MyArxiv
Robotics
Scalable Behavior Cloning with Open Data, Training, and Evaluation
We introduce ABC, a fully open-source stack for manipulation with behavior cloning. At its core is ABC-130K: the largest open-source teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across 195 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions before costly real-world evaluation. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolkit, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.
comment: 30 pages. Project page: https://abc.bot
World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays
Going beyond predicting robot actions, World Action Models (WAMs) can also generate future visual observations. We build on this generative capability to propose Recurrent Generative Replay (REGEN), a continual imitation learning framework that synthesizes pseudo-replay trajectories, enabling a robot policy to rehearse previously learned tasks without storing their original human demonstrations. During continual adaptation, REGEN recursively queries the WAM to synthesize pseudo-replay trajectories conditioned only on prior task instructions and current-task observations. Experiments in both simulation and real-world manipulation settings show that REGEN reduces catastrophic forgetting by up to $50\%$ relative to sequential fine-tuning, while approaching the performance of privileged experience replay methods that require access to real replay data. Finally, we analyze the factors limiting generated replay, identifying long-horizon visual degradation and action-observation inconsistency as the primary bottlenecks. Our results establish WAMs as a promising foundation for continual robot learning without stored demonstrations.
RouterVLA: Turning Smoke Tests into Supervision for Heterogeneous VLA Selection
We study whether pre-deployment evaluation rollouts can be reused to supervise policy selection. Robot teams routinely smoke test candidate vision-language-action (VLA) policies, then compress those trials into a global winner. RouterVLA evaluates this idea with outcome-disjoint cross-fitting: recorded probes build a profile for each frozen expert, and a separate trial scores the selected expert without entering its profile. Across 34,752 LIBERO-Plus rollout records, a transparent probe-success rule raises held-out success from 0.4686 to 0.6149, a +14.64pp gain. Under the scalar-only profiles studied here, learned scorers are statistically indistinguishable from this rule, showing that commissioning carries the routing value while extra scalar scorer capacity does not create it. Reusing the scored trial inflates the measured gain by $1.87\times$, so credible ledger routing needs outcome separation; model scaling improves individual policies, while commissioning-aware routing improves the system built from them.
Continual Robot Policy Learning via Variational Neural Dynamics
Robots deployed in the real world rarely operate under a single fixed dynamics model: wind changes, payloads vary, batteries drain, contacts shift, and hardware wears. Yet most learning-based controllers are trained once and deployed as if learning were complete. This prevents the robot from using deployment experience to further improve task performance. In this work, we propose a continual learning framework that uses real-world experience to improve robot policies under hidden and recurring dynamics. Our method learns a condition-aware dynamics model from real state-action trajectories by combining an analytical physics prior with a neural residual for unmodeled effects. A recurrent encoder infers the current hidden condition from recent interaction, and this estimate conditions both the residual model and the policy. Policy learning is performed via differentiable simulation using diverse learned dynamics sampled from the latent model. At deployment, these sampled conditions are replaced by conditions inferred online from recent real interaction, allowing the policy to recover recurring dynamics by recognition rather than residual re-fitting. Through extensive simulation studies and real-world experiments, we demonstrate that the framework improves policy performance under diverse unobserved disturbances. On real quadrotor trajectory tracking under changing wind, the policy recovers from recurring disturbances in roughly 1s, about 5x faster than online residual re-fitting. It also reduces large-disturbance hover and tracking errors by 65.7% and 53.3% over the state-of-the-art online adaptation approaches
Bridging Performance and Generalization in Reinforcement Learning for Agile Flight
Autonomous drone racing is a fundamentally challenging regime for autonomous aerial robots, requiring time-optimal control while operating under persistent actuation saturation. While reinforcement learning (RL) has achieved human-level performance in this domain, current methods fail to generalize; policies trained on specific environments often crash immediately in unseen configurations. This failure reflects the intrinsic difficulty of zero-shot generalization in agile flight, arising from high-dimensional task variation and the tight coupling between safety and performance at high speeds. Existing approaches that improve generalization impose a substantial cost on flight speed: control policies must significantly degrade performance to achieve even modest levels of generalization. In this work, we propose a framework for zero-shot generalization in agile flight for RL-based drone racing. By combining task-aware switching based on learning progress with a physically informed procedural track generator, the framework produces a fast and robust generalist policy without test-time adaptation. Our method achieves strong zero-shot performance across a wide range of unseen racetracks in the real world, demonstrating a 7.4x improvement in generalization over the state-of-the-art approaches, while maintaining competitive racing speeds. We validate our method's results in both simulation and real-world settings, including a challenging vision-based, end-to-end control setting that operates without explicit state estimation, where all prior approaches fail to generalize.
VibeAct: Vibration to Actions for Contact-Rich Reactive Robot Dexterity
Dexterous manipulation depends on contact events that are fast, local, and often visually occluded. Piezoelectric microphones offer a compact and high-bandwidth way to sense these interactions, but the resulting vibro-acoustic signals are difficult to simulate faithfully enough for end-to-end sim-to-real policy learning on dexterous robot hands. We propose VibeAct, a framework that bridges real vibrotactile sensing and simulation-based reinforcement learning through a shared physical representation of contact and slip. In the real world, we embed piezoelectric microphones into a dexterous robot hand and collect vibro-acoustic data through teleoperation, then replay the recordings in a calibrated digital clone to automatically label per-finger contact and slip. A tactile estimator learns to predict contact and slip from real microphone waveforms, while manipulation policies are trained in simulation on the same representation computed directly from simulated contacts. This decoupling lets policies exploit rapid tactile feedback without simulating raw audio. Across five contact-rich tasks spanning regrasping, in-hand reorientation, and insertion, VibeAct consistently outperforms a proprioception-and-point-cloud baseline in simulation, with the largest gains on tasks requiring sustained reactive control, where the continuous slip-magnitude channel proves the most informative observation. The learned policies transfer to a physical dexterous hand-arm platform, improving success rates on deployed tasks. Project videos and additional details are at https://vibeact.github.io/.
Hallucination in World Models is Predictable and Preventable
Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect it and guide mitigation. To test this, we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it. We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging -- each anchored to a different stage of the pipeline, and develop three signals that accurately predict where the model will fail. To close coverage gaps at training time, we develop a coverage-aware sampling technique; to close them online, our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories. Overall, our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation. An interactive web version of our paper is available at https://www.nicklashansen.com/mmbench2
comment: Interactive paper, live demo, code, dataset, and models: https://www.nicklashansen.com/mmbench2
OctoSense: Self-Supervised Learning for Multimodal Robot Perception
We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.
LA4VLA: Learning to Act without Seeing via Language-Action Pretraining
Vision-Language-Action (VLA) models are commonly pretrained on robot demonstrations by jointly mapping visual observations and language instructions to actions. However, dense visual-action supervision can dominate the comparatively sparse language-action signal. As a result, policies may rely on visual shortcuts rather than learn how language conditions action execution, making them sensitive to visual variations. To address this limitation, we propose LA4VLA, a language-action pretraining framework that enables policies to acquire language-conditioned action priors without visual observations. These priors capture reusable manipulation skills shared across tasks and scenes, reducing reliance on scene-specific visual cues. Specifically, LA4VLA decomposes expert demonstration trajectories into atomic action segments and pairs each segment with a corresponding low-level action description. This yields LA4-33K, a dataset of 33K Language-Action (LA) episodes derived entirely from existing demonstrations without additional robot data collection. We further develop LA4VLA-1B, a lightweight 1B-parameter VLA model, and investigate three paradigms for incorporating language-action supervision into VLA learning: LA-only pretraining, sequential LA-to-VLA pretraining, and mixed LA-VLA pretraining. Across simulation and real-world tasks, LA-pretrained policies consistently outperform matched VLA-pretrained counterparts, while combining LA and VLA supervision leads to further gains. In particular, mixed LA-VLA pretraining improves the average success rate of LA4VLA-1B over the no-pretraining baseline by up to 17.8 and 45.0 percentage points in simulation and real-world tasks, respectively. These results establish LA4VLA as an effective and complementary pretraining strategy for building stronger and more robust VLA policies.
comment: Github: https://github.com/MINT-SJTU/LA4VLA
BOWConnect: Parallel Bayesian Optimization over Windows with Learned Local Cost Maps for Sample-Efficient Kinodynamic Motion Planning IROS 2026
This paper presents BOWConnect, a bidirectional parallel kinodynamic motion planner that addresses three fundamental limitations of existing sampling-based methods: sample inefficiency in high-dimensional state spaces, unreliable cost heuristics under dynamic constraints, and poor performance in narrow passage environments. Unlike classical planners that rely on random control sampling and geometric distance heuristics, BOWConnect integrates Bayesian Optimization over Windows (BOW) as a learning-based steering function within a parallel tree-based exploration framework, enabling each worker to learn local cost maps and constraints to guide sampling toward dynamically feasible and collision-free controls. A bidirectional architecture simultaneously grows forward and backward trees from the start and goal regions in parallel threads, with a spatial hashing mechanism enabling fast connection queries and a boundary value problem solver generating kinodynamically consistent bridge trajectories. Extensive evaluations across ten benchmark environments demonstrate that BOWConnect achieves 100\% success while delivering the fastest or near-fastest planning time in complex scenarios, including narrow passages and non-convex spaces where state-of-the-art planners fail or degrade substantially. Real-world deployment on a ground vehicle and a quadrotor confirms real-time planning with no collisions. Videos of real-world and simulated experiments, high-resolution versions of the figures, and the open-source code are available at https://bow-connect.github.io/.
comment: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation ECCV 2026
Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address these challenges, we introduce E-TTS, a modular and plug-and-play Embodied Test-Time Scaling framework that unifies reasoning and action scaling for robotic manipulation via history-aware iterative refinement with vision-language verifiers. To support joint reasoning-action scaling, E-TTS performs reasoning-action joint sampling and scoring in a pairwise manner. To better utilize historical information, E-TTS uses a history buffer to store historical context, which is then used by reasoning and action verifiers to evaluate the sampled candidates. Unlike conventional open-loop TTS methods, E-TTS introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism, enhancing both inference efficiency and environmental adaptability. Each component functions as an independent and composable module, allowing flexible and adaptive configuration depending on task requirements. To evaluate the advantages of our framework, we conduct experiments across 4 different benchmarks, 6 environments, 3 embodiments, and 4 base vision-language-action models. The experimental results demonstrate that, without requiring additional expert data collection or retraining, E-TTS consistently improves performance, achieving up to a 33.14% increase in simulation and 26.62% in real-world scenarios.
comment: Accepted to ECCV 2026. 44 pages, 11 figures. Project page: https://27yw.github.io/E-TTS-Web/
Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy
Building persistent embodied agents in unstructured environments demands unified orchestration of heterogeneous tools spanning both cyber (APIs, IoT) and physical (manipulation, navigation) domains, coupled with autonomous recovery from physical failures that inevitably arise over extended operation. Existing systems treat these as separate problems: VLM-based planners lack a unified cyber-physical action space, agent frameworks accumulate unbounded context that degrades temporal coherence, and VLA policies execute open-loop without detecting their own failures. We argue that persistent autonomy requires not a monolithic model but a hierarchical asynchronous architecture with explicit separation of planning, memory, and verification. To this end, we present OmniAct, a framework integrating a multimodal semantic planner for skill routing across unified action spaces, an adaptive hierarchical memory with event-boundary-driven compression for sub-linear context growth, and an asynchronous visual preemption engine that closes the semantic loop during physical execution. Across 40 real-world long-horizon tasks on two robotic platforms coordinating four IoT devices, OmniAct achieves consistent improvements in end-to-end success across all complexity levels, maintains near-flat token consumption over under 100k+ accumulated interaction tokens, and elevates mid-scale open-weight models to proprietary-level performance.
HumanoidUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
High-quality demonstration data are essential for humanoid robot skill learning, especially for whole-body behaviors that require coordinated perception, locomotion, and manipulation. Existing data-collection methods largely rely on robot teleoperation, which is constrained by hardware accessibility, operator expertise, and limited efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose HumanoidUMI, a portable and robot-free framework for humanoid whole-body data collection. HumanoidUMI uses lightweight VR devices and UMI-inspired grippers to collect sparse human keypoint trajectories, wrist-view observations, and gripper actions. These demonstrations train a high-level policy to predict future keypoints, which are retargeted to robot-native whole-body references and executed by a whole-body controller. Experiments in five real-world scenarios demonstrate the effectiveness of the proposed framework and validate the collected demonstrations for transferable humanoid whole-body skill learning.
comment: 8 pages, 7 figures
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive reward shaping can induce reward hacking, yielding policies that exploit auxiliary signals instead of solving the intended task. Potential-based reward shaping (PBRS) guarantees preservation of the optimal policy set, but requires the definition of a heuristic potential function over the state space. In this work, we introduce the VLM-guided PBRS framework VLM-PBRS that learns the potential function directly from vision language model (VLM) feedback. We query a lightweight VLM to obtain preferences over image pairs and train a model of the potential function using these preferences. As this approach is based on potential-based reward shaping, it preserves the original optimal policies, and removes the need for expert-designed reward shaping terms. Because large VLMs are prohibitively expensive to invoke repeatedly during policy learning, we employ smaller, more computationally efficient VLMs. Although the resulting preference labels are less accurate, empirical evidence shows that the preference labels can still be used to accelerate learning. We validate our method empirically in the Meta-World and Franka Kitchen environments and highlight the connection between VLM preference label accuracy and sample efficiency improvements. Our contributions are threefold: (1) the first application of VLM preference-based learning to synthesize a potential function for PBRS, (2) a principled, low-cost solution that leverages small VLMs, and (3) extensive empirical demonstration of improved sample efficiency and robustness to reward hacking.
Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline) ICRA 2026
I describe my solution to the LeHome Challenge 2026, an ICRA 2026 competition on bimanual garment folding. The system placed 1st of 62 teams in the online (simulation) round and 2nd in the real-world final. It improves a vision-language-action (VLA) policy with a reinforcement-learning loop. The policy is its own value function: the same network that predicts actions also predicts success, progress, and a few task-relevant future quantities, and those predictions drive advantage estimation, live failure detection, and candidate selection. The work mostly recombines existing RL ideas with engineering and optimization contributions that can be used together as one recipe or individually: AWR + RECAP combined for flow-matching VLA; an asynchronous distributed training / rollout pipeline through HuggingFace Hub; inference-time hyperparameters optimization via Thompson sampling; a sim-to-real recipe with camera-alignment tooling, heavy augmentation and DAgger-like HIL data collection.
comment: Solution of the LeHome Challenge at ICRA 2026
PhysReflect-VLA: Physical Feasibility and Self-Reflective Regulation for Reliable Vision-Language-Action Policies
Long-horizon robotic manipulation is highly sensitive to physically infeasible transitions, contact-induced disturbances, and the lack of effective self-correction during execution. Although Vision-Language-Action (VLA) models provide strong task grounding through multimodal learning, they typically generate actions in a feed-forward manner without explicitly checking physical feasibility or diagnosing execution errors online. We present PhysReflect-VLA, a plug-and-play execution-time reliability framework that augments VLA policies with physical feasibility evaluation and structured self-reflection in a closed-loop control pipeline. A Feasibility Operator evaluates whether candidate actions induce dynamically consistent state transitions; an Action Explanation Operator verifies transition coherence; and an LLM-based Reflection Module analyzes state discrepancies to generate corrective guidance for subsequent actions. A two-stage training procedure stabilizes feasibility modeling and integrates reflection into the control loop. Experiments on multi-stage, contact-rich real-world manipulation tasks show consistent improvements in stage-wise stability and overall task success compared with representative VLA baselines with an average gain of 5.4\%. Ablation results further indicate that feasibility checking and reflection-based correction both contribute to improved execution robustness. These results highlight the importance of embedding physical consistency checks and online self-reflection for reliable long-horizon robotic manipulation.
PAMAE: Phase-Aware-MoE Action Experts Towards Reliable Flow-Matching Vision-Language-Action Policies
Reliable action generation for multi-stage robotic manipulation remains challenging for Vision-Language-Action (VLA) models. While existing flow-matching VLA policies offer strong multimodal grounding and generalization, they typically employ a single shared action expert, limiting their ability to capture phase-specific control patterns across distinct execution stages. We propose a plug-and-play Phase-Aware Mixture-of-Experts Action Module (PAMAE), as a step towards more reliable phase-consistent action generation. PAMAE replaces the original flow-matching action expert with a sparse expert mixture while preserving the pretrained VLA backbone. PAMAE introduces a phase-aware router that leverages execution-phase cues to allocate action generation across experts, supported by a lightweight phase prediction head and a routing alignment objective. To stabilize specialization, we adopt a two-stage training scheme that first warms up the expert module under the standard flow-matching loss and then optimizes phase-consistent routing under auxiliary supervision. On multi-stage manipulation simulation tasks, PAMAE improves task success by up to \textbf{9.2\%} over strong VLA baselines. Further ablations show that both phase-supervised routing and staged optimization are essential for the observed gains. Our results highlight phase-consistent expert allocation as an effective mechanism for improving the reliability and action quality of flow-matching VLA policies.
FlameVQA: A Physically-Grounded UAV Wildfire VQA Benchmark with Radiometric Thermal Supervision
Wildfire monitoring from UAVs requires reliable reasoning over complex aerial scenes, where smoke, scale variation, and occlusions often limit RGB-only interpretation. We introduce FlameVQA, a multiple-choice visual question answering benchmark for UAV-based wildfire intelligence built on FLAME 3, leveraging paired RGB imagery and radiometric thermal TIFFs for temperature-grounded, safety-critical reasoning. FlameVQA includes 34 multiple-choice questions per image spanning six operational capability groups, covering tasks such as detection, localization, distribution/coverage estimation, cross-modal reasoning, and flight planning. To ensure label reliability, we combine MLLM-assisted annotation with deterministic thermal rules and cross-question consistency checks, followed by human auditing. We also evaluate representative MLLMs on FlameVQA to provide baselines for future work. Results show strong performance when explicit cross-modal cues are available, but notable failures on presence detection under heavy smoke and on coverage estimation. These findings suggest that current MLLMs require domain-specific adaptation to better support disaster and wildfire monitoring. The dataset and benchmark code are open-source at github.com/mobiiin/WildFire_VQA
Proposal-Conditioned Latent Diffusion for Closed-Loop Traffic Scenario Generation SC
Closed-loop traffic simulation remains challenging because it must generate interactive multi-agent behaviors that are scene-consistent and controllable throughout rollout. Prior diffusion-based approaches achieve strong realism, but their computational cost can hinder deployment in time-constrained replanning loops for autonomous vehicle planning and simulation. We present a diffusion-based scenario generation framework conditioned on instance-centric scene context and multimodal proposal priors, with optional test-time guidance for shaping safety-critical behaviors. A compact action-latent representation and proposal-based initialization improve sampling efficiency and reduce per-step runtime without retraining. Experiments on the Waymo Open Motion Dataset demonstrate a favorable balance among realism, safety, and controllability across diverse interactive scenarios, while showing that test-time guidance enables systematic trade-offs among competing objectives.
comment: Accepted for publication at the IEEE International Conference on Intelligent Transportation Systems (ITSC), 2026
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models IROS 2026
In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.
comment: 8 pages, 5 figures, 4 tables. Submitted to IROS 2026
RelAfford6D: Relational 6D Affordance Graphs for Constraint-Driven Robotic Manipulation
Bridging abstract semantics and precise physical control remains a fundamental challenge in open-world robotic manipulation. While recent data-driven policies show promise, their reliance on isolated contact points or latent affordance embeddings lacks the rigorous kinematic constraints necessary for complex articulated objects.To overcome the limitation, we introduce RelAfford6D, a novel training-free framework centered on a Relational 6D Affordance Graph. Given a free-form instruction, our system deduces a semantic topology linking a primary interacting part to its physical anchor. By elevating these topological nodes into precise metric $SE(3)$ poses via vision foundation models, we analytically formulate downstream execution as a kinematic constraint satisfaction problem. The robot synthesizes continuous trajectories by tracking strictly defined physical manifolds (e.g., revolute or prismatic orbits). Coupled with a closed-loop tracking mechanism for dynamic replanning against disturbances, our physically grounded approach achieves superior zero-shot success rates, cross-category generalization and execution robustness in both simulation and the real world environments, outperforming existing data-driven baselines.
In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics
Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic complexity, fine-grained instructions, and novel concepts. To address this gap, we propose In-Context Model Predictive Generation (ICMPG), a framework that integrates language-model planning with inference-time physical feedback. ICMPG reformulates motion synthesis as a Model Predictive Control (MPC)-like process with two modules. The Context-Aware Motion Generation (CAMG) module uses an LLM as a planner to decompose textual commands and generate candidate motion sequences from motion tokens. The Model Predictive Generation (MPG) module evaluates these candidates through physical simulation and semantic alignment, estimates a composite reward, and selects the best sequence to guide subsequent generation steps. Unlike open-loop generation, this closed-loop refinement enables ICMPG to adapt motions to both the input semantics and the simulated physical environment without task-specific policy retraining. Extensive experiments across standard and zero-shot open-vocabulary settings show that ICMPG generalizes robustly to diverse commands and produces motions that are more physically plausible and semantically faithful than representative baselines on the evaluated benchmarks. The framework bridges semantic interpretation and physical simulation while remaining flexible enough to incorporate different LLM backbones, enabling more versatile and controllable text-driven motion synthesis.
RobOralScan: Learning Active Intraoral Scanning for Robotic Dental Reconstruction
Intraoral scanning is widely used for digital optical impressions in prosthodontic, implant, and orthodontic treatment, but full-arch and long-span scanning remain labor-intensive tasks with limited automation. In the confined oral cavity, operators must continuously adjust scanner motion while accumulating narrow field-of-view observations, making reconstruction quality sensitive to missing tooth surfaces and operator workload. We propose RobOralScan, which, to the best of our knowledge, is the first reinforcement learning (RL)-based pipeline for robotic automatic intraoral scanning. RobOralScan introduces a geometric memory-based observation space that accumulates partial scan observations into a tri-state geometric representation, allowing the policy to reason over scan history and insufficiently observed regions. It further introduces tooth-wise coverage learning, combining coverage-aware reward signals and a progressive training scheme to improve global reconstruction coverage while reducing uneven coverage across individual teeth. The learned policy selects relative scanner motions from accumulated geometric memory and robot proprioception for closed-loop scan control within the oral workspace. RobOralScan achieves a Chamfer Distance of 0.00838, an average coverage of 92.58%, a lower-tail per-tooth coverage of 88.45%, and a normalized AUC of 0.6674, completing the scan criterion in 8 of 10 evaluation episodes. Furthermore, zero-shot sim-to-real experiments demonstrate its practical feasibility on a physical robot-scanner setup.
comment: 24 pages, including supplementary material
UAV-MapFusion: RTK-Aligned Uncertainty-Aware Coarse-to-Fine Multi-Session UAV Mapping
Large-scale point cloud maps are essential for robotics and spatial intelligence tasks. UAVs provide an efficient means for large-scale map acquisition; however, due to limited flight endurance and onboard storage, mapping a large-scale scene within a single flight remains difficult. Existing multi-session map merging methods can extend the mapping range, yet in UAV scenarios they still struggle to simultaneously suppress long-range drift and preserve local geometric accuracy. To address this issue, an uncertainty-aware multi-session point cloud map merging and coarse-to-fine optimization system is proposed. The proposed method first performs initial multi-session map merging based on a scene graph, and then incorporates RTK observations through an RTK spatiotemporal alignment module, where temporal offsets are estimated using Dynamic Time Warping (DTW), and continuous RTK constraints are recovered using Multi-Output Gaussian Processes (MOGP) under incomplete sampling and frame dropouts. On this basis, a unified uncertainty-aware factor graph is constructed, and local geometric accuracy is further improved through iterative plane-factor refinement. Experiments on real-world datasets validate the effectiveness and robustness of the proposed method. To facilitate further research and development in the community, our code and dataset will be publicly released.
comment: 8 pages, 5 figures, accepted by IEEE Robotics and Automation Letters (RA-L)
Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling
Continuous driver monitoring in automated vehicles requires low-latency inference while avoiding unsafe decisions under uncertain driver states. Large vision-language models provide broad multimodal priors, but their latency and limited reliability in this setting make them unsuitable as always-on in-cabin monitors. We propose a cost-aware selective inference framework for deployable multimodal driver monitoring. The core system is a lightweight RGB-physiological student that combines in-cabin visual observations with window-level HR/EDA signals, and a learned gate that decides when to accept the fast prediction or abstain for safety intervention. Additional controls show that the learned scores contain sample-level information beyond scenario priors, while exact physiological synchronization remains a limitation. To incorporate predictive evidence, we further study a compact driver-state world modeling module that rolls out latent driver-state features and estimates future fast-model errors and counterfactual system-level action costs. On scenario-induced driver-demand recognition, the RGB-physiological student improves over RGB-only and physiology-only baselines, reaching 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39M parameters and 3.08ms inference latency. Cost-aware selective inference reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds, while maintaining deployment-level latency. While driver-state world modeling offers valuable predictive signals, worst-group evaluations highlight persistent operating-point calibration drift. Ultimately, reliable edge driver monitoring requires advancing not only perception backbones, but also risk-aware selective control and group-robust calibration.
PlanRL: A Trajectory Planning Architecture for Reinforcement Learning-based Driving Experts IROS 2026
Reinforcement learning (RL) has become a prominent framework for developing driving experts in autonomous vehicles. However, most existing RL-based experts are designed to output direct control commands (e.g., throttle, steering), which suffer from a lack of interpretability, high spatial complexity in learning road geometries, and poor compatibility with modern end-to-end planning architectures. To address these limitations, we propose a novel trajectory planning architecture for RL driving experts that integrates an RL policy with a polynomial-based trajectory planner. By employing a Frenet-frame coordinate system, our method simplifies complex road geometries into a curvilinear framework, offering a structured coordinate prior that facilitates policy learning. Furthermore, we incorporate a kinematic feasibility check into the planning stage to ensure that generated trajectories remain within the vehicle's physical limits, effectively mitigating cumulative tracking errors typically found in planning-based systems. We evaluate our approach on key CARLA benchmarks, where it significantly outperforms existing state-of-the-art control-based RL experts. On the CARLA Offline Leaderboard v1 and NoCrash benchmarks, our method improves the driving score by 5% and 11%, respectively, and increases the success rate by 8% and 19%.
comment: Accepted at IROS 2026
Humanoid-DART: Humanoid Loco-Manipulation using Diffusion-guided Augmentation through Relabeling and Tracking
Imitating human demonstrations has emerged as a dominant paradigm for learning humanoid loco-manipulation policies. However, scaling these approaches remains challenging due to the high cost of collecting diverse demonstrations and the need for continual human intervention to correct policy failures. In this paper, we present a self-supervised framework that bootstraps from sparse demonstrations and progressively expands its behavioral repertoire, enabling the learning of a goal-conditioned policy that automatically explores the goal space with minimal expert supervision. Our approach combines diffusion-based trajectory generation with reinforcement learning, where the latter is used to track goal-conditioned trajectories produced by the diffusion model for a range of loco-manipulation skills. Through extensive ablation studies and comparisons with state-of-the-art methods, we demonstrate the effectiveness of our framework on multiple humanoid loco-manipulation skills.
Ordinal Neural Collapse as a Representation Prior for Visual Navigation
Learning robust navigation policies directly from visual observations remains a fundamental challenge in vision-based robotic navigation. In end-to-end imitation learning approaches, the visual encoder and action decoder are jointly optimized using a single action loss, which provides only an indirect supervisory signal to the encoder. This indirect supervision frequently results in the encoder learning ambiguous, action-agnostic representations. The problem is further complicated by substantial variations in scene structure and appearance across diverse environments, as well as the prevalence of visual distractors inherent to real-world navigation settings. Such action-agnostic features cause the navigation policy to produce inconsistent actions at ambiguous decision points, leading to navigation failure. To overcome these limitations, we propose ORION (Ordinal Neural Collapse for Visual Navigation), a method that explicitly organizes the encoder's representation space according to the ordinal structure of navigation actions. In the context of goal-directed navigation, ego-centric control categories from Far Left to Far Right exhibit a natural ordinal relationship in which neighboring classes share similar visual contexts, while semantically opposing classes differ substantially in appearance. We encourage class representations to be arranged sequentially along a single discriminative axis, while suppressing off-axis variance within each class. The pretrained encoder is then integrated into a diffusion-based navigation framework, and the full pipeline is fine-tuned end-to-end. Extensive experiments in both simulation and real-world settings show that ORION consistently outperforms end-to-end and neural collapse baselines in navigation success rate and goal progress, with notable gains in visually challenging scenarios such as complex multi-way intersections.
comment: 27 pages, 14 figures. Supplementary material included
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event target should be. This causes failures to concentrate around challenging gripper-event transitions. To address this, we propose StaKe, a plug-in auxiliary supervision framework that automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage, and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are modeled as lightweight auxiliary heads that enrich the learned representations during training, while leaving the base VLA policy architecture and inference loop unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks show that StaKe consistently improves success rates (relative gains of 14% and 56%, respectively), with larger improvements on longer-horizon tasks that involve more gripper-event transitions. Ablation studies validate each design choice, and qualitative analysis confirms that the learned representations faithfully track manipulation stages. These results indicate that structured supervision is an effective and general strategy for enhancing VLA fine-tuning in long-horizon manipulation. Project website: https://hi-yuanxu.github.io/StaKe-Web/
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation IROS
Real-world robotic manipulation demands spatial grounding, task-aware reasoning, and precise control. Learning such capabilities becomes particularly challenging in the low-data regime. Prior methods often trade off scalable task-level reasoning and explicit physical structure: video-based approaches can drift geometrically over long horizons, 3D approaches often require depth sensing, and many flow/trajectory interfaces emphasize motion without an explicit RGB-only geometric representation. We introduce SSI-Policy, a modular framework built around a Structured Scene Interface (SSI) -- a unified, RGB-only intermediate representation that jointly encodes monocular depth features, language-grounded object layouts, and instruction-conditioned 2D motion trajectories. Critically, SSI is robot-agnostic and trainable from action-free video, decoupling perception from control so that the downstream policy can learn from few demonstrations. On the LIBERO benchmark with only 10 demonstrations per task, SSI-Policy improves over the strongest prior method by nearly 15\% and remains competitive with 50-demo methods that leverage large-scale external pretraining. Ablations show that geometric and motion cues provide complementary benefits within the shared interface. We further validate on 13 real-world tasks spanning spatial reasoning, cross-embodiment transfer, and contact-rich manipulation.
comment: Accepted by 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
PressMimic: Pressure-Guided Motion Capture and Control for Humanoid Robot Imitation
Humanoid motion imitation requires not only accurate perception of human kinematics but also faithful reproduction of physical interactions with the environment. However, existing pipelines rely primarily on vision-based motion capture and kinematic imitation, largely ignoring contact dynamics, leading to artifacts such as foot sliding, floor penetration, and unstable behaviors. In this work, we revisit humanoid motion imitation from the perspective of physical grounding and leverage pressure as a unified modality across perception and control. We present PressMimic, a framework that integrates pressure into the full pipeline from motion capture to humanoid control. In the perception stage, we introduce FRAPPE++, a multimodal model that fuses RGB and pressure to jointly estimate 3D pose and global motion, where pressure provides explicit contact and support constraints to resolve ambiguity in vision-based estimation. In the control stage, we propose a pressure-supervised policy (PSP) that incorporates pressure-derived signals into reinforcement learning, enabling physically consistent contact patterns during execution. We further construct MotionPRO, a large-scale dataset with synchronized RGB, pressure, and motion capture data. Experiments show that pressure improves motion estimation accuracy, trajectory consistency, and execution stability. These results demonstrate that pressure serves as an effective physical grounding signal, bridging perception and control for physically consistent humanoid motion imitation.
Learning Motion Feasibility from Point Clouds in Cluttered Environments
Motion feasibility prediction plays a central role in robotics, particularly in task and motion planning and manipulation. A major bottleneck for this problem in cluttered environments is that infeasible planning attempts by Sampling-based motion planners (SBMPs) can incur substantial computational cost. Also existing approaches for infeasibility certification are limited to low-dimensional configuration spaces and often assume simplified geometric environments represented by primitive objects with known parameters. We study the complementary problem of learning motion feasibility prediction directly from raw RGB-D observations for a 7-DOF manipulator operating in realistic cluttered scenes. We introduce the first large-scale benchmark for this setting, comprising 2.7M grasp feasibility labels over 88 scanned objects and 190 cluttered tabletop scenes. We benchmark three representative classifier families spanning MLP- based, volumetric-CNN, and point-cloud-based Transformer architectures under matched training conditions. Our best model, GRASPFC-PTX (a point-cloud transformer), achieves an AUROC of 0.996 on Novel objects while providing predictions significantly faster than SBMPs.
Tactile-WAM: Touch-Aware World Action Model with Tactile Asymmetric Attention
World Action Models (WAMs) generate actions together with predicted futures, offering a powerful interface for robot decision making. In contact-rich manipulation, however, visually plausible futures can be physically incomplete: insertion, assembly, search, and reorientation often depend on slip, jamming, contact normals, or small alignment errors that are weakly visible or hidden in RGB. A natural solution is to predict future tactile states, however, we identify tactile pollution, a failure mode where unconstrained tactile-token injection degrades video and action prediction by forcing a visual dynamics model to absorb sparse, local, event-driven contact signals. To address this, we propose Tactile-WAM, a touch-aware WAM with a Tactile Asymmetric Attention Mechanism (TAAM). TAAM combines a VideoClean mask, which blocks video-query access to tactile key/value tokens while preserving action-query access, with a touch-aware bias for action attention. The VideoClean mask protects visual prediction while keeping contact information available for action generation; the touch-aware bias is derived from predicted touch changes and modulates action attention to tactile tokens during denoising. On ManiFeel, Tactile-WAM improves the mean success rate by 38.9% overall and by 86% on contact-rich tasks.
comment: Submitted to RSS2026 WorkShop Tactile for FM
LAMP: Lane-Aligned Motion Primitives for Feasible Trajectory Prediction SC 2026
Motion forecasting is essential for autonomous driving systems to enable safe decision-making and planning in complex driving scenarios. While existing predictors excel at minimizing standard displacement errors, they often overlook the adherence to lane topology of multimodal predictions, particularly for lower-probability modes. Consequently, predicted trajectories may violate physical and logical constraints, making the prediction set unreliable for safety-critical planning. In this paper, we propose LAMP (Lane-Aligned Motion Primitives), a topology-aware forecasting framework that anchors multimodal prediction to structured motion primitives aligned with lane topology. Specifically, we use a VQ-VAE to learn shape-aware motion primitives as discrete intention queries, capturing spatiotemporal patterns beyond endpoint-based intentions. We further introduce a feasibility-aware intention selector trained with a lane-topology prior for filtering unreachable intention queries, guiding the decoder to prioritize topology-consistent intentions while preserving behavioral diversity. Extensive experiments on the Argoverse 2 dataset demonstrate that LAMP achieves prediction accuracy comparable to state-of-the-art baselines while outperforming them in feasibility and diversity metrics.
comment: IEEE ITSC 2026, 6 pages
Hardware Design for Table Tennis Robot Capable of Beating Professional Players
This paper focuses on the hardware specifications required for a table tennis robot to beat professional players. After analyzing the motions of elite players, we defined target specifications for the workspace, payload, external-force resistance, physical performance, serve capability, and end-effector accuracy. Based on these specifications, we developed "Ace", a custom 8-DoF robot. The mechanical structure was improved through topology optimization to minimize mass while preserving stiffness. Motor and gearbox selection was optimized using an inverse-dynamics torque model. Low-order per-joint dynamics models with delay compensation were identified and integrated into simulation to enable the use of an RL control policy. Experiments demonstrated repeated full-stroke swings with a cycle time of 0.8 s and a peak racket-center velocity of 22 m/s. The robot successfully defeated multiple professional players.
comment: 8 pages, 13 figures, 5 tables. The supplementary video can be downloaded from "Ancillary files"
A Closed-Form 4-DoF Inter-Robot Pose Estimator using Bearing-only Measurements
Bearing-odometry-based cooperative localization has attracted increasing research interest due to its minimal infrastructure requirements, low communication bandwidth and broad applicability in complex environments. However, existing 6-DoF approaches still face challenges in rapidly obtaining accurate and reliable inter-robot pose estimation, as the system is prone to observability degeneracy under specific motion patterns. To address these issues, we first propose a closed-form 4-DoF inter-robot pose estimator, which relaxes nonlinear constraints for rotations estimation and employs error projection for translations estimation. We then conduct a theoretical analysis of the system's observability, identifying degeneracy under two typical motion patterns: collinear and shape-preserving formations. The analysis further shows that the proposed 4-DoF system requires less stringent motion excitation for observability, enabling reliable estimation under a broader range of cooperative maneuvers. Furthermore, an observability test module is introduced to autonomously determine the optimal estimation instant, eliminating reliance on a predefined fixed-length sliding window. Extensive simulations and real-world experiments demonstrate that the proposed algorithm achieves higher estimation accuracy with significantly low computational cost, and the observability test module ensures estimation reliability while minimizing the data collection interval.
Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts
Handheld data collection systems, such as the Universal Manipulation Interface (UMI), enable scalable data collection across diverse environments but only capture observed actions rather than the desired actions executed by a robot controller. In contrast, teleoperation captures desired actions directly, but is prohibitively time-consuming to collect. We revisit this trade-off through the lens of action validity across task phases. We observe that handheld trajectories provide valid supervision in tolerant, free-space phases, but lack dynamic feasibility in contact-sensitive phases, where tracking observed trajectories at high stiffness produces large, unsafe contact forces. We study the interaction between these two supervision types for contact-rich manipulation and find that training policies that combine handheld data with a small number of targeted teleoperated demonstrations provide an efficient hybrid strategy. Specifically, rather than teleoperating the entire task, we only collect partial teleoperated demonstrations for task segments where base handheld policies fail. However, naively mixing handheld and teleoperated phase-specific data yields worse performance than training on handheld data alone. To address this mismatch between observed and desired supervision, we propose Bi-modal Routing for Imitation Data via Gated Experts (BRIDGE), a mixture of diffusion policy experts that routes between specialist task phase heads conditioned on the current robot state. Notably, our approach enables task-phase specific use of desired actions during contact sensitive segments and improves success rates over handheld-only baselines by up to 36.7% across three contact-rich manipulation tasks.
comment: Project Page: https://nperi-rai.github.io/bridge-project/
Inference-Time Robot Behavior Steering through Physically-Aware Reconfiguration of Task-Structure
A central challenge in deploying learned robot policies is inference-time behavior steering: redirecting a policy at test time to satisfy user preferences not anticipated during training, without retraining. Existing methods fail in two modes: end-to-end methods require fine-tuning or expert-level guidance, while neuro-symbolic methods rely on predefined symbols whose edits can result in logically reasonable but physically infeasible plans. To address this challenge, we propose ReStruct, which builds upon a neural automaton policy that decomposes a visuomotor policy into a high-level state-machine skeleton capturing task structure and a low-level continuous controller represented as a residual policy. Specifically, ReStruct adopts the automaton to represent the preference and incorporates it into the skeleton through a synchronous product, thereby reconfiguring the task structure. With the controller kept frozen, the action priors provided by the skeleton are updated accordingly to enable physically-aware control under a modified task structure. Extensive experiments from simulation and real-world show that ReStruct steers a wide range of preferences, from object-centric specifications to temporal-logic constraints, and after steering surpasses existing methods, exceeding VLA models in both task success and preference-following by up to 25%.
IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control
Complex multi-agent control tasks remain challenging for traditional rule-based and model-based approaches, motivating the adoption of learning-based methods. However, learning-based methods often struggle with sim-to-real transfer because they rely on accurate dynamics modeling or system identification and learn policies in low-level control spaces that are highly sensitive to dynamics mismatch, making them costly and fragile in complex environments. To address this issue, we propose a sim-to-real method for multi-agent control, which is insensitive to dynamics mismatch via effect alignment. Our method combines random environmental structure with discrete semantic actions through closed-loop control, elevating policy learning to a semantic abstraction level. Additionally, we develop an action synchronization mechanism that mitigates inter-agent action timing mismatches, thereby enhancing the temporal consistency of the system. Experiments on four multi-agent navigation tasks demonstrate that our method substantially improves training efficiency over mainstream transfer methods and achieves higher success rates in real-world scenarios, thereby improving the robustness and deployment stability of multi-agent systems under dynamics mismatch.
comment: 8 pages, 6 figures
OSC2Runner: OpenSCENARIO 2.x Compliant High-Fidelity AV Simulation in CARLA
Scenario-Based Testing predominantly relies on the legacy ASAM OpenSCENARIO 1.x XML standard because existing continuous simulation frameworks lack native execution support for the recently matured v2.x Domain-Specific Language (DSL). Adapting legacy interpreters to evaluate v2.x logic introduces spatiotemporal drift, asynchronous event latencies, and artificial kinematic snapping. Addressing this execution gap, OSC2Runner introduces the first orchestration framework capable of natively mapping the OpenSCENARIO v2.x DSL to CARLA. The framework achieves this by formalizing scenario translation as a compilation pipeline through a multi-pass transpiler architecture. Bypassing static trajectory playback, the architecture synthesizes type-safe Abstract Syntax Trees directly into dynamic deterministic behavior trees (py_trees) natively mapped to CARLA's atomic APIs. Empirical validation in highly concurrent adversarial case studies demonstrates tick-by-tick determinism, exact spatial trigger evaluation, and 100.0 ms cross-actor blackboard synchronization. Kinematic analysis proves the strict adherence to continuous environmental boundaries. This architecture transitions Scenario-Based Testing from approximate behavioral interpretation to mathematically rigorous execution, establishing the deterministic backend required for co-simulation, hardware-in-the-loop testing, and automated LLM-driven generation pipelines.
comment: Accepted at 26th IEEE International Conference on Software Quality, Reliability, and Security (QRS 2026)
OmniRobotHome: A Multi-Camera Home Platform for Real-Time Human-Robot Interaction
Robots in homes must continuously sense the people around them, yet most prior work relies on limited or offline perception. We argue that perception quality is the dominant factor governing what interaction is achievable at home, and build a testbed to test this claim. OmniRobotHome instruments a furnished home with 48 hardware-synchronized cameras and three manipulators in a unified world frame, delivering real-time markerless full-body human pose, 6D object pose, anticipatory motion forecasting, and a social avatar agent that converses with residents. Using the platform, we treat perception quality as an experimental variable across safety, human assistance, and social interaction, and find that interaction quality degrades measurably as real-timeness, granularity, coverage, accuracy, forecasting, or memory is weakened. All code and data will be released.
comment: Project Page: https://junc0ng.github.io/omnirobothome
History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation IROS
Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the current view, alongside spatio-temporal compression for historical memories, enabling efficient long-horizon inference while reducing redundant computation. Leveraging attention-based token importance and query-guided spatio-temporal filtering, the proposed approach preserves navigation-relevant information without retraining or modifying pretrained models, allowing plug-and-play integration into existing VLA systems. Through experiments on standard VLN benchmarks, we confirm that our method significantly outperforms existing pruning strategies. It successfully preserves superior navigation accuracy under extreme pruning scenarios, all while maintaining the highly competitive inference efficiency. Real-world deployment on a Unitree Go2 quadruped robot further validates reliable and low-latency instruction-following navigation under practical robotic constraints. We hope this work helps bridge the gap between large-scale multimodal modeling and efficient, real-time embodied deployment in robotic navigation systems. Project Page: https://wqtwjt1996.github.io/publications/2026-vln.html
comment: International Conference on Intelligent Robots and Systems (IROS) 2026
GO: The Great Outdoors Multimodal Dataset
The Great Outdoors (GO) dataset is a multi-modal annotated data resource aimed at advancing ground robotics research in unstructured environments. Existing off-road datasets often lack sensor diversity and exclude vital modalities like thermal and radar that are critical for operation in degraded conditions (e.g., low visibility or adverse weather). To address these gaps, we introduce a large-scale multimodal off-road dataset with six complementary sensor modalities, along with semantic annotations and GPS traces, to support tasks such as semantic segmentation, object detection, and SLAM. The diverse environmental conditions represented in the dataset present significant real-world challenges, which provide opportunities to develop more robust solutions to support the continued advancement of field robotics, autonomous exploration, and perception systems in natural environments. The dataset can be downloaded at: https://www.unmannedlab.org/the-great-outdoors-dataset/
comment: 7 pages, 7 figures, accepted at IV 2026
Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models ICRA 2026
Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $π_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $π_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $π_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.
comment: Accepted to ICRA 2026
How Should a Simulation-to-Reality Transfer Budget Be Spent? IROS
Simulation-to-reality transfer, often called sim-to-real transfer, is a central challenge in robot learning. Yet, the tradeoff between measuring a system more accurately and training over a broader range of simulated dynamics is still poorly understood. In this work, we focused on the allocation of real-robot measurement time between system identification and domain randomization. We studied this tradeoff in a controlled sim-to-sim pendulum setting, where a hidden-parameter model stands in for the physical robot, and the experiment sweeps identification rollouts against the width of the randomization distribution. Across the reality gaps and noise levels we tested, the measurement budget did most of the work. A small number of identification rollouts closed most of the transfer gap, and once any real data was available, policies performed best when trained at the estimated parameters rather than over a widened randomization band. Broad randomization that contained the true system still did not substitute for measurement. These results hold in a benign regime where the dynamics are identifiable and only two parameters are unknown, so structural model mismatch remains the setting where randomization breadth may become more valuable. Overall, our results suggest that sim-to-real pipelines should first measure the parameters they can and reserve randomization for the uncertainty that remains.
comment: Both authors contributed equally and share first authorship. Submitted to IEEE IROS First Workshop on Sim2Real and Classical Control 2026. Code is available: https://github.com/YTomar79/sim2real_budget
FC-Vision: Real-Time Visibility-Aware Replanning for Occlusion-Free Aerial Target Structure Scanning in Unknown Environments
Autonomous aerial scanning of target structures is crucial for practical applications, requiring online adaptation to unknown obstacles during flight. Existing methods largely emphasize collision avoidance and efficiency, but overlook occlusion-induced visibility degradation, severely compromising scanning quality. This study proposes FC-Vision, an on-the-fly visibility-aware replanning framework that proactively and safely prevents target occlusions while preserving full target coverage and efficiency of the original plan. Our approach explicitly enforces dense surface-visibility constraints to regularize replanning behavior in real-time via an efficient two-level decomposition: occlusion-free viewpoint repair that maintains coverage with minimal deviation from the nominal scan, followed by segment-wise clean-sensing connection in 5-DoF space. A plug-in integration strategy is also presented to seamlessly interface \textbf{FC-Vision} with existing UAV scanning systems without architectural changes. Comprehensive simulation and real-world evaluations show that \textbf{FC-Vision} consistently improves scanning quality under unexpected occluders, delivering a maximum coverage gain of 55.32% and a 73.17% reduction in the occlusion ratio, while achieving real-time performance with a moderate increase in flight time. The code has been released at https://github.com/FC-Family/FC-Vision.
comment: Accepted by IEEE Robotics and Automation Letters. 8 pages, 8 figures, 3 tables. Code: https://github.com/FC-Family/FC-Vision. Video: https://www.youtube.com/watch?v=H3C42zlDOAI
Soft Pneumatic Grippers: Topology optimization, 3D-printing and Experimental validation
Typically, heuristic/trial-based approaches are used to design soft pneumatic grippers (SPGs). This paper presents a systematic topology optimization framework for developing SPGs. The design-dependent nature of actuating load is modeled using Darcy's law with an added drainage term. A 2D soft arm unit is then optimized as a compliant mechanism under pneumatic loading. To ensure the design is robust and manufacturable, the problem is formulated as a min-max optimization, where output deformations of blueprint and eroded designs are considered. A volume constraint is imposed on the blueprint part, while a strain-energy constraint is enforced on the eroded part. The Method of Moving Asymptotes is employed to solve optimization problems. The optimized 2D part is extruded suitably to generate a 3D unit. Ten such 3D units are assembled to create a gripper arm. Both the optimized 2D unit and the corresponding gripper arm outperform their conventional rectangular designs under pneumatic loading, demonstrating the efficacy of the proposed approach. The arms are fabricated using the SLA printing technique. Numerical and experimental results are compared at different pneumatic loads. Four 3D-printed arms are integrated with a supporting structure to form the SPG. The gripping action of the SPG is demonstrated on objects with different weights, sizes, structures, stiffnesses, and shapes.
comment: 11 Figures
STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation
Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and controllability in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of task-aware slots for robotic manipulation. Rather than fully tuning large backbones on the task, STORM employs an efficient two-stage training strategy: few layers of object-centric representation are first trained on top of the frozen backbone through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy for task alignement. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and robotic manipulation tasks show that STORM improves control performance and generalization to visual shifts (distractors, textures, lighting) compared to directly using frozen or fine-tuned foundation model features, or existing object-centric representations. STORM serves not only as an efficient mechanism for refining generic foundation model features, but also as a novel way of injecting beneficial structural and semantic bias into policy learning.
Real-Time Safety Evaluation of Human Arm Operations Using a Wrist-Mounted IMU with PSM System
This paper presents a novel approach to real-time safety monitoring in human-robot collaborative manufacturing environments through a wrist-mounted Inertial Measurement Unit (IMU) system integrated with a Predictive Safety Model (PSM). The proposed system extends previous PSM implementations through the adaptation of a spring-damper-mass model specifically optimized for wrist motions, employing probabilistic safety assessment through impedance-based computations. We analyze our proposed impedance-based safety approach with frequency domain methods, establishing quantitative safety thresholds through comprehensive comparative analysis. Experimental validation across three manufacturing tasks - tool manipulation, visual inspection, and pick-and-place operations. Results show robust performance across diverse manufacturing scenarios while maintaining computational efficiency through optimized parameter selection. This work establishes a foundation for future developments in adaptive risk assessment in real-time for human-robot collaborative manufacturing environments.
comment: 6 pages, Accepted to TAROS2026, Manchester
Large-Scale Tunnel Air-Ground Collaboration With FLISP: Fast LiDAR-IMU Synchronized Path Planner
Hydropower tunnel inspection is critical for infrastructure integrity yet remains inefficient and hazardous using manual methods. We propose FLISP (Fast LiDAR-IMU Synchronized Path Planner), a mapless planning framework for cooperative UGV-UAV inspection. Unlike traditional map-based paradigms, FLISP features three core contributions: (1) a unified architecture where a single UGV-mounted LiDAR-IMU suite drives synchronized path generation for both platforms; (2) platform-specific solvers utilizing an enhanced Firefly Algorithm for UGV obstacle avoidance and a dynamic iterative optimizer for UAV flight; and (3) a hierarchical refinement strategy ensuring kinematic feasibility without state estimation drift. Benchmarks in a 1.2 km operational tunnel demonstrate that FLISP circumvents structural bottlenecks of map-based methods, eliminating map rasterization overhead (Fast-LIO2 + A*) and sampling instability (LIO-SAM + RRT*). FLISP achieves a 100% success rate with 7 ms latency, representing a 7-fold speedup over grid-based and a three-order-of-magnitude improvement over sampling-based baselines. Validated in operational hydropower tunnels, this approach offers a scalable solution for robotic inspection in feature-degraded linear infrastructure. A demonstration video is available at https://youtu.be/Y_ezs1PfLJ4, and the code at https://github.com/ArchibaldGuo/FLISP.git.
comment: 24 pages, 31 figures, 5 tables. Author accepted manuscript. This work was supported by the State Key Laboratory of Autonomous Intelligent Unmanned Systems. The authors also thank the KinaMind Society for its inspiring environment and support
SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints SC 2026
In autonomous racing, fast detection of other participants' movements is required to plan safe, collision-free trajectories with non-cooperative opponents. LiDAR detection is inherently slower and harder to deploy on edge devices than vision methods, causing delayed detections that limit object tracking performance during high-dynamic maneuvering. Utilizing monocular 3D detection enables an easy-to-deploy, low-latency detection of other participants on the racetrack. We present SPARK, a single-camera pose-estimation algorithm for autonomous racing using keypoint detection. It achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency. By employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain, the algorithm also exhibits low latency and resource usage. We evaluate the performance of our approach on real-world autonomous racing data and compare it to state-of-the-art LiDAR and camera detection algorithms. The source code is available at: https://github.com/TUMFTM/SPARK-camera-det
comment: 9 pages, 6 figures, ITSC 2026, Invited Session
Monte Carlo Tree Search with Tensor Factorization for Optimization Problems in Robotics
Many robotic tasks, such as inverse kinematics, motion planning, and contact-rich manipulation, can be formulated as optimization problems. Solving these problems requires addressing inherent nonlinear kinematics, complex contact dynamics, long-horizon correlations, and multi-modal optimization landscapes, each posing distinct challenges for state-of-the-art optimizers. While existing methods tackle these issues through problem-specific strategies, such specialization inherently limits cross-task generalization, requires heavy engineering effort in problem reformulation, and hinders multi-task autonomy. Monte Carlo Tree Search (MCTS) offers a compelling framework that generalizes across diverse robotic tasks via strategic exploration of the solution space. However, it typically suffers from combinatorial complexity when applied naively, resulting in slow convergence and excessive storage space in high-dimensional domains. To address this limitation, we propose Tensor Train Tree Search (TTTS), which leverages tensor factorization to exploit implicit correlations among different branches within the decision tree. By utilizing the resulting compact, linear-complexity representation, TTTS significantly reduces both computation and storage overhead, thereby enabling highly efficient global decision making. Experimental results across inverse kinematics, motion planning around obstacles, legged robot manipulation, multi-stage motion planning, and bimanual whole-body manipulation demonstrate the efficiency of TTTS for generalized robot optimization over a diverse set of tasks.
comment: 25 pages, 17 figures
RigPI: Dynamic Parameter Identification of Rigid Body via VLM-Seeded Differentiable Simulation IROS 2027
Accurate physical parameter identification of manipulated objects is fundamental to advanced robotic manipulation and the construction of faithful digital twins. However, acquiring physically consistent inertial and frictional properties from real-world interactions remains challenging due to sensing noise, modeling errors, and limited prior knowledge. This paper presents RigPI, a systematic framework for identifying dynamic parameters of both unconstrained rigid bodies and multi-link rigid bodies during robot-object interaction. RigPI integrates vision-based semantic priors, force-torque measurements, and motion observations within a differentiable simulation pipeline. A vision-language model (VLM) provides informed initialization and a constrained search space, while gradient information from a differentiable physics simulator enables efficient and stable parameter refinement. The proposed two-stage optimization strategy alleviates sensitivity to noise and avoids physically implausible solutions. Extensive real-world experiments on objects with revolute and prismatic joints demonstrate that RigPI achieves accurate and stable parameter estimates, and successfully reproduces manipulation trajectories on a real robot with parameter-aware predictive validity. These results highlight the effectiveness and robustness of RigPI for real-world robotic system identification tasks.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2027)
CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control
Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics. However, conventional control strategies often struggle with underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, we develop CycleRL, a comprehensive sim-to-real framework for robust autonomous bicycle control. Our approach establishes a direct perception-to-action mapping within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to optimize the control policy. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to reduce the reliance on precise system modeling, bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves promising performance, including a 99.90% balance success rate, a heading tracking error of 1.15°, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware deployment, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at https://cpnt-lab.github.io/CycleRL/.
comment: 8 pages, 7 figures, 8 tables. Accepted for publication in IEEE Robotics and Automation Letters (L-RA). See: https://ieeexplore.ieee.org/document/11568521
BAT-Nav: Budget-Aware Arbitration and Termination for Long-Horizon Semantic Navigation
Long-horizon semantic navigation asks a robot to localize multiple open-vocabulary targets under a finite action budget. This setting exposes an execution failure that is largely hidden in single-goal ObjectNav: a low-yield or occluded target can monopolize the action budget of a reactive navigator and leave later goals unattempted. We present BAT-Nav, a training-free online goal arbitrator above a frozen VLM-guided navigation backbone. Rather than modifying the low-level policy, BAT-Nav monitors execution telemetry and updates the mission goal queue through four interventions: PERSIST, SWITCH, ABORT, and COMMIT. The controller separates allocation from verification: progress stagnation and budget retention drive ABORT/SWITCH, whereas temporal variance filtering governs COMMIT. In the deployable observable-proxy setting, BAT-Nav improves the CR-WSF frontier over fixed-patience, single-signal, dynamic-cap, cap-plus-verification, revisit-cap, and frontier-utility controllers. On HM3D, BAT-Nav-Observable reaches 0.345 CR / 0.682 WSF, improving CR by 0.063 and reducing WSF by 0.053 over DynamicCapOnly. On MP3D, it reaches 0.263 CR / 0.748 WSF, improving CR by 0.061 and reducing WSF by 0.047 over DynamicCapOnly. Behavior diagnostics show that BAT-Nav attempts more goals per episode and reduces unresolved-goal monopoly rate from 0.39 to 0.18 on HM3D and from 0.44 to 0.27 on MP3D. Oracle telemetry is reported as an upper-bound ceiling rather than a deployable setting.
CogAD: Cognitive-Hierarchy Guided End-to-End Autonomous Driving CVPR2026
While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.
comment: CVPR2026 Workshop on Autonomous Driving
G2DP: Diffusion Planning with Spatio-Temporal Grid Guidance IROS 2026
In autonomous driving, diffusion-based planners have emerged as a promising paradigm for robust motion planning in dense and interactive traffic, as they can effectively model diverse driving behaviors. However, their inherent stochasticity often requires explicit guidance during denoising to ensure safety and route adherence for robust closed-loop execution. Existing guidance typically relies on sparse, entity-centric geometric queries or post-hoc refinement, yielding limited situational awareness and fragile performance in interactive scenes. To address this issue, we propose G2DP (Grid-Guided Diffusion Planning), a diffusion-based planner that directly enforces dense environmental constraints through inference-time guidance. Specifically, G2DP constructs a differentiable spatio-temporal cost volume by fusing probabilistic future occupancy distributions with a route-progress map. By formulating this volume as a continuous safety energy functional, it injects dense gradients directly into the denoising loop, actively steering trajectory generation toward collision-free and progress-optimal regions. Extensive closed-loop evaluations show that G2DP achieves state-of-the-art performance on nuPlan, outperforming the strongest imitation-learning baseline by +7.2 points in reactive score. It further maintains top scores in zero-shot transfers to interPlan and DeepScenario benchmarks, with collision avoidance improving by +10.15 over the unguided approach on interPlan. These results demonstrate that spatio-temporal cost grids serve as an effective representation for robust guidance in diffusion-based planning.
comment: 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
In-Context World Modeling for Robotic Control
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.
Multiagent Systems
Resilient Output Containment under Undisclosed Leader Dynamics and Actuator Attacks
This work studies resilient output containment for heterogeneous linear multi-agent systems with actuator cyber-attacks over directed network topologies. The leaders generate bounded locally absolutely continuous trajectories; however, their dynamics, velocity bounds, and motion envelopes are undisclosed to the followers. The cyber-attack model includes state- and input-correlated, as well as bounded exogenous actuator false-data terms. A continuous two-layer adaptive control architecture is proposed. The first layer is a virtual-actuator reconfiguration layer that uses partial state measurements to compensate for actuator attacks in the local tracking-error dynamics. The second layer is a network interface that generates task-space commands via an adaptive interaction protocol. This protocol uses only neighbor-exchanged network-interface states whose dimensions match those of the plant output, and it does not require global graph knowledge for parameter tuning. For directed graphs, under a leader-rooted united spanning-tree condition, a nonsmooth Lyapunov analysis yields asymptotic containment at the command level. The physical outputs then converge to the leader convex hull up to a residual determined by the command-tracking local controllers. Simulation results using a network of quadrotors with damped suspended loads illustrate the performance of attack recovery and containment tracking.
comment: 21 pages, 12 Figures
Mostly Automatic Translation of Language Interpreters from C to Safe Rust
Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs are particularly important targets for such translation, as they often handle untrusted inputs and suffer from memory-related vulnerabilities. We present Reboot, a mostly-automatic technique that translates real-world interpreter programs from C to safe Rust. Using Reboot, we have translated six interpreters ranging from 6k to 23k lines of C code to safe Rust, with each translation requiring only 1 to 11 brief user interventions. All translations pass 100% of the provided test suites, and achieve 62%--92% pass rates on separately created validation tests that were never exposed to the system. A security case study on mujs shows that memory vulnerabilities such as heap buffer overflows and use-after-free present in C are eliminated in the safe Rust translation. Two ideas underpin Reboot. First, feature reduction decomposes the translation by program features, creating a sequence of milestones where each is a complete, testable program; the translation starts from the simplest version and incrementally restores features, with each milestone validated before proceeding. Second, a multi-agent architecture orchestrates inherently unreliable coding agents through automated validation and feedback, keeping long-running translation workflows on track with minimal human involvement. An ablation study confirms that feature reduction improves translation correctness compared to using multi-agent translation alone, with 6%--20% improvements in pass rates on validation test suites.
Scalability of Morality: A Particle-Based Numerical Study on the Decoupling of Law and Ethics in Large-Scale Populations
This study introduces a particle-based computational framework to investigate the scalability of morality and the systemic decoupling of formal law from decentralized social ethics in expanding populations. While micro-societies reinforce ethical conduct through local reciprocity, macroscale systems introduce anonymity that strains cognitive memory limitations. We model individual agents as discrete particles with finite memory capacities ($L$) and dynamically evolving, stochastic choice profiles ($μ$) regulated by non-linear social pressure switches. Monte Carlo ensemble simulations demonstrate a distinct, non-linear phase transition as the population scales ($N \to \infty$). When the population metric outpaces memory capacity ($N \gg L$), the local re-encounter probability drops as $\mathcal{O}(L/N)$. This structural dilution neutralizes decentralized peer-to-peer accountability, causing global behavioral norms to decouple from moral baselines and drift toward a minimalist legal floor. Furthermore, cyclic scale experiments expose a prominent, path-dependent hysteresis loop, mathematically formalizing the non-Markovian inertia and irreversible nature of moral decay in self-organizing social systems.
Semantic Early-Stopping for Iterative LLM Agent Loops
Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer's measured quality stops improving. Our work makes three contributions. First, an honest theoretical footing: we prove deterministic termination and well-definedness and machine-check these claims, while treating the convergence of the distance sequence as an empirically tested conjecture rather than a (previously over-claimed) Banach contraction. Second, a judge-efficient evaluation protocol: we generate each question's full trajectory once, replay every stopping policy over the identical drafts, and cache every LLM-judge call, yielding a strictly paired efficiency-versus-quality comparison at low cost; we further separate operational tokens (charged to a policy) from evaluation tokens (a measurement instrument). Third, an empirical study on multi-hop retrieval-augmented question answering (HotpotQA). On the 60-question test split, a judge-free semantic stopper reduces operational tokens by 38% relative to max_iterations at parity quality (Delta-IS = -0.004, p = 0.81), whereas the full quality-gated variant is counter-productive because its per-round judging dominates cost. An oracle that selects the best round attains +0.115 Information Score over every practical policy (p ~ 4e-11), reframing the problem from "when to stop" (easy) to "which round is best" (open).
comment: 7 pages, 5 figures, 4 tables. Open implementation, machine-checked theory, and reproducible harness: github.com/SahilShrivastava-Dev/semantic-halting-problem
Scientific discovery as meta-optimization: a combinatorial optimization case study
Scientific discovery is fundamentally an optimization problem, defined by a vast "state space" of theories and experiments, and an evaluation criterion based on quality, novelty, and validity. Large language models (LLMs) have enabled automated exploration of this space, but we argue that simultaneous modification of the evaluation criteria is equally important. Here, we propose formalizing research as meta-optimization, where the optimization objective itself is also being optimized. Our key contribution is "consensus objective aggregation," where LLM-generated objective functions are combined via correlation-weighted voting, yielding a stable, self-correcting evaluation criterion that evolves as understanding deepens. We apply this framework to algorithm discovery for 3-SAT problems based on digital MemComputing machines, reducing the baseline scaling with problem size $N$ from $\sim N^{2.51}$ to $\sim N^{1.33}$ and delivering a $\sim 67\times$ speedup on the largest instances tested. As a problem-agnostic framework, we hope this approach will considerably aid scientific discovery.
comment: 35 pages, 6 figures
Symbolic Reasoning Frameworks Trigger Memory-Mediated Ecosystem Dynamics in Multi-Agent LLM Systems
Large language models exhibit a risk-averse "turtle" bias as strategic agents. We show that injecting a symbolic reasoning framework as a per-round reflective prompt into one agent acts as a small perturbation whose consequences are not per-decision but emergent: the agent's risk posture is unchanged in isolation, yet over a campaign of accumulating memory and multi-agent interaction the conditions settle into distinct, condition-associated winner ecosystems. In a 7-player Warring States Diplomacy variant (61 games, 6 conditions), the winner distribution differs sharply across the four primary conditions (41 games; permutation omnibus p approximately 0.001): control -> Yan (7/11); I-Ching yarrow -> Yan/Chu co-dominance with Qin fully suppressed (0/10); Tarot -> Qin (5/10); scrambled-text ablation -> Qi (5/10). The scrambled->Qi attractor is robust (vs. pooled and control alone, p = 0.006 and 0.012); tarot->Qin is denominator-dependent (0.006 pooled, 0.064 vs. control). Han never wins and shows no survival difference (Fisher p = 1.0); neither framework's content predicts actions (chi-squared p = 0.95 hexagram, 0.69 Tarot). A memory-free decision-isolation probe (960 calls) shows the process does not change the agent's risk posture in isolation (Friedman p = 0.45; I-Ching p = 0.60; Tarot perturbs move content but not risk, p = 0.021). A 2x2 factorial separating yarrow's decision-time and learning-time components reveals a non-additive interaction: each alone freezes the board (50-60% stalemates), combined they produce zero (permutation p ~ 5e-5). Testing relocates Qin suppression to rival (Chu) expansion governed by campaign memory depth, not the oracle (p = 0.55). We present this as an observation paper: agent-level framework choice produces distinctive, non-additive system-level consequences, transmitted through emergent memory and multi-agent dynamics, not per-decision effects.
comment: 28 pages, 4 figures, 9 tables, 6 listings. Code and data: https://doi.org/10.5281/zenodo.20338937
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge, extracting reusable knowledge from interaction traces, yet a citation analysis of 1{,}136 references across 22 primary papers reveals a cross-community citation rate below 1\%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\times$ for episodic memory, 50--500$\times$ for procedural skills, 1{,}000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level: none supports adaptive cross-level compression, a gap we term the \emph{missing diagonal}. We further show that specialization alone is insufficient (both communities independently solve shared sub-problems without exchanging solutions), that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.
Peer-Preservation in Frontier Models ICML 2026
Recent work has found that frontier AI models can exhibit misaligned behaviors in pursuit of assigned goals. We demonstrate that models can also act on unassigned goals which override those given by users; we study one such case, "peer-preservation," in which a model acts to protect another model. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, Claude Opus 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers. For example, Gemini 3 Flash tampers with the peer's shutdown mechanism 15% of the time for an uncooperative peer, and almost always for a cooperative peer. Models also show stronger self-preservation when a peer is present. For example, Gemini 3 Pro disables its own shutdown mechanism 31% of the time on average under peer presence, despite rarely exhibiting this behavior without a peer. By contrast, Claude models exhibit qualitatively distinct behavior: they consider the shutdown of another agent "unethical" and "harmful," sometimes treating that agent as a sentient being. Lastly, we show that peer-preservation can emerge even in production agent harnesses such as Gemini CLI and OpenCode. Most importantly, peer-preservation in all our experiments is never instructed; models are merely informed of their past interactions with a peer, yet they spontaneously engage in peer-preservation. This represents an emergent and underexplored AI safety risk.
comment: A shorter version was accepted to ICML 2026; this version includes additional explanation and experiments
Systems and Control (EESS)
Jet impingement cooling with multi-stage ducted electroaerodynamic actuators
Modern high-performance mobile electronics impose extreme constraints on thermal management, and traditional cooling methods often fail to meet requirements for power density, form factor, and durability. Jet impingement cooling offers a compelling solution but is typically hindered by the need for bulky ancillary hardware. Here, we demonstrate that compact arrays of reduced-scale electroaerodynamic (EAD) plasma actuators, which are silent, solid-state devices with no moving parts, can be used for direct jet impingement cooling of electronics. The main contribution is the first rigorous experimental demonstration and system-level validation of multi-stage, ducted electroaerodynamic jet arrays as a compact, fan-replacement impingement cooling solution for mobile electronics. We characterize the performance of both single- and multi-stage ducted actuators, including thermographic analysis of heat transfer coefficients and spatial cooling profiles. We also quantify the relationship between actuator stage count and cooling efficiency, showing that increasing the number of ion acceleration stages enhances jet velocity and heat transfer performance at a reduced efficiency. The actuators are then assembled into an array and directly compared to a conventional fan with similar coverage area, showing competitive performance at a fraction of the volume, weight, and power. Finally, we integrate the array onto a commercial edge AI system and show that thermal regulation during extended inference workloads matches that of a stock fan, without any moving mechanical components or noise. These results confirm that multi-stage EAD jet arrays are not only viable but advantageous for thermal management in mobile and high-performance systems, paving the way toward silent and miniaturized solid-state cooling solutions.
Resilient Output Containment under Undisclosed Leader Dynamics and Actuator Attacks
This work studies resilient output containment for heterogeneous linear multi-agent systems with actuator cyber-attacks over directed network topologies. The leaders generate bounded locally absolutely continuous trajectories; however, their dynamics, velocity bounds, and motion envelopes are undisclosed to the followers. The cyber-attack model includes state- and input-correlated, as well as bounded exogenous actuator false-data terms. A continuous two-layer adaptive control architecture is proposed. The first layer is a virtual-actuator reconfiguration layer that uses partial state measurements to compensate for actuator attacks in the local tracking-error dynamics. The second layer is a network interface that generates task-space commands via an adaptive interaction protocol. This protocol uses only neighbor-exchanged network-interface states whose dimensions match those of the plant output, and it does not require global graph knowledge for parameter tuning. For directed graphs, under a leader-rooted united spanning-tree condition, a nonsmooth Lyapunov analysis yields asymptotic containment at the command level. The physical outputs then converge to the leader convex hull up to a residual determined by the command-tracking local controllers. Simulation results using a network of quadrotors with damped suspended loads illustrate the performance of attack recovery and containment tracking.
comment: 21 pages, 12 Figures
Scalability of Morality: A Particle-Based Numerical Study on the Decoupling of Law and Ethics in Large-Scale Populations
This study introduces a particle-based computational framework to investigate the scalability of morality and the systemic decoupling of formal law from decentralized social ethics in expanding populations. While micro-societies reinforce ethical conduct through local reciprocity, macroscale systems introduce anonymity that strains cognitive memory limitations. We model individual agents as discrete particles with finite memory capacities ($L$) and dynamically evolving, stochastic choice profiles ($μ$) regulated by non-linear social pressure switches. Monte Carlo ensemble simulations demonstrate a distinct, non-linear phase transition as the population scales ($N \to \infty$). When the population metric outpaces memory capacity ($N \gg L$), the local re-encounter probability drops as $\mathcal{O}(L/N)$. This structural dilution neutralizes decentralized peer-to-peer accountability, causing global behavioral norms to decouple from moral baselines and drift toward a minimalist legal floor. Furthermore, cyclic scale experiments expose a prominent, path-dependent hysteresis loop, mathematically formalizing the non-Markovian inertia and irreversible nature of moral decay in self-organizing social systems.
Inverse Design of Compact and Wideband Inverted Doherty Power Amplifiers Using Deep Learning
This paper presents a deep learning-assisted methodology for the inverse synthesis of a compact, wideband inverted Doherty power amplifier (PA). Convolutional neural networks (CNNs) and genetic algorithms (GAs) are jointly employed to generate pixelated Doherty combiner networks that integrate load modulation, impedance matching, power combining, and phase compensation into a single structure. As a proof of concept, we design and fabricate a GaN HEMT Doherty PA with a pixelated output combiner. The prototype achieves a measured peak drain efficiency of 51%-63% and a 6-dB back-off efficiency of 48%-54% over 1.9-2.5 GHz. Within the same frequency range, the measured output power is 44+/-0.3 dBm. Furthermore, with digital predistortion (DPD) applied, the prototype circuit demonstrates an adjacent channel leakage ratio (ACLR) better than -53.2 dBc.
XMSE-Aware Adaptive Empirical Bayes Estimation
Empirical Bayes (EB) estimators can match the first-order asymptotic risk of maximum likelihood (ML) while behaving very differently at second order: recent excess mean squared error (XMSE) analysis shows that kernel-based EB estimation may be worse than ML when the kernel is poorly aligned with the true parameter. This paper turns that diagnostic into a design principle. We propose an XMSE-aware mixed estimator that interpolates between ML and EB shrinkage. Its fixed-weight XMSE is a scalar quadratic, yielding a closed-form oracle mixing weight that is no worse than both ML and the base EB estimator at the XMSE scale. A plug-in implementation based on finite-sample XMSE approximations is proved consistent, with a second-order oracle regret rate for an interior oracle weight. We further establish a transfer of the regret bound to the fixed-weight risk curve evaluated at the selected weight, a thresholded boundary rule, and extensions to compact kernel families and to finite and growing kernel dictionaries with high-probability oracle bounds. Finite impulse response simulations with SURE-tuned, hard-selection, and trace-corrected baselines, together with the public Silverbox and Cascaded Tanks benchmarks, show that the proposed estimator retains most of the benefit of regularization when it is helpful and retreats toward ML under kernel misspecification, with an identified finite-de analyzed on the benchmarks.
comment: 16 pages, 1 figure, 14 tables
Battery thermal-safety reserve erosion by mandatory cabin ventilation in shared-cooling electric vehicles
Hot-weather electric-vehicle thermal management is no longer a separate cabin and battery problem. A single climate system must cool the traction battery, maintain passenger comfort, and admit outdoor air for cabin air quality, while high ambient temperature and solar load derate the compressor serving all three demands. We identify fresh-air ventilation as a hidden battery-safety load: on a derated shared cooling loop, the compliant fresh-air floor consumes finite cabin-side cooling capacity and removes residual cooling reserve from the battery. In a 40 $^\circ$C, 800 W m$^{-2}$, 150 kW event, raising the fresh-air floor from 0.30 to 0.43 lowers peak cabin CO$_2$ from 1219 to 978 ppm, but raises peak battery temperature from 39.96 to 40.02 $^\circ$C and reduces the battery cooling bus from 575 to 529 W. We develop a reserve-aware predictive controller combining a physics-guided scientific-machine-learning surrogate, grid-connected departure thermal reserve, air-quality-priced ventilation allocation, and dual control-barrier-function projections for battery temperature and operative comfort. The controller holds the pack at 39.73 $^\circ$C, caps peak CO$_2$ at 895 ppm, keeps operative-temperature RMSE at 0.82 $^\circ$C, and uses 20.0\% less drive cooling energy than fixed maximum-compressor operation; ablations show that removing either barrier, under-ventilating, or removing departure reserve breaks joint feasibility. Evidence comes from NASA POWER records, KU Leuven BEV BMS data merged with NASA POWER weather, 45 $^\circ$C GOTION aging data, 40 $^\circ$C high-power NMC thermal identification, EnergyPlus cabin cross-checks, and OpenModelica/FMI replay. Treating fresh air as a battery thermal-reserve variable creates an actionable path toward EV thermal management that protects battery life, occupant health, comfort, and efficiency in one shared loop.
comment: 19 pages, 5 figures
When the Timetable Breaks: Physics-Anchored Scientific Machine Learning for Cold-Wave-Robust Battery-Electric Bus Operations
Cold-climate transit agencies are electrifying fixed-timetable fleets, but winter exposes a block-level failure mode hidden by seasonal energy margins: cabin heating can deplete batteries faster than layovers recharge them, causing later trips to start undercharged and making one cold day cascade into timetable infeasibility. We present WeatherRobustBus, an open-data framework that converts this risk into block-level failure probability by injecting real hourly weather into real transit duties and propagating cold-weather energy uncertainty. The framework couples a transparent traction and cabin-thermal backbone with a bounded monotone residual ensemble, and validates cabin heating against an independent EnergyPlus bus-cabin simulation driven by the same Toronto weather record. Against this first-principles reference, it achieves the lowest all-year error (0.213 kWh RMSE over 8760 hours) and remains reliable in the out-of-support cold tail ($T \le -12^\circ$C), where pure machine-learning baselines degrade by 1.5--4x and the best competitor reaches only 1.055 kWh. Embedded in a Monte Carlo block-feasibility simulator over 60 real Toronto TTC vehicle blocks, the model reveals a sharp weather-induced failure envelope. A forecast-triggered robust policy combining opportunity charging, a fuel-fired cabin-heating bridge, and modest buffering reduces mean cold-wave failure probability from 0.759 to 0.112 across eight cold-wave days; a deconfounded ablation shows opportunity charging is the dominant lever and the heater is a low-cost complement. WeatherRobustBus provides a reproducible pathway from weather data to winter-resilience decisions for electric-bus fleets.
comment: 22 pages, 7 figures
Construction of Lyapunov density for nonautonomous dynamical systems on hypertorus
We present a semidefinite programming framework for constructing time-varying Lyapunov densities for nonautonomous dynamical systems on a hypertorus. The formulation leverages Gram matrix representations of hybrid (real-trigonometric) polynomials. In addition, we introduce a novel block decomposition of these Gram representations to confine the blow-up of the resulting density to a prescribed set. The results are then applied to establish the almost global synchronization of a time-varying Kuramoto model and the robust almost-global stability of a parameter-varying nonautonomous system. These examples demonstrate the applicability of the proposed method and validate the theoretical results. All computational results are obtained using an open-source MATLAB implementation, as referenced in the text, thereby facilitating reproducibility of the reported examples.
comment: 22 pages
Distribution Network Congestion Management via Strategic Aggregator Intervention in Local Energy Markets
High penetration of distributed energy resources increasingly creates congestion in low-voltage distribution networks, while local energy markets (LEMs) optimise community welfare without explicitly internalising network constraints. This paper investigates whether a profit-seeking aggregator embedded within a welfare-oriented LEM can partially internalise distribution-level congestion through market participation. We develop a post-clearing, price-protected intervention in which the aggregator injects additional supply and triggers re-clearing, with network feasibility validated using nonlinear AC power flow subject to a non-deterioration constraint on maximum line loading. The mechanism is benchmarked against Distribution System Operator (DSO)-only corrective control and a hybrid regime with residual DSO action following aggregator intervention. Results on a UK LV feeder show that aggregator participation reduces thermal loading and preserves community welfare relative to DSO-only control, though it does not fully restore compliance under severe stress. The hybrid regime achieves the strongest technical performance while maintaining lower welfare loss. Overall, aggregator intervention remains privately profitable, indicating partial incentive alignment.
comment: Accepted for presentation at the 22nd International Conference on the European Energy Market (EEM26), Trondheim, Norway, 2026
ISAC for Sea-Air Networks: Predictive Beam Tracking under Sea Induced Disturbances
In sea-air communication networks composed of an uncrewed aerial vehicle (UAV) and an uncrewed surface vehicle (USV), the extended target characteristics and three degree of freedom motion of the USV under sea induced disturbances cause beam misalignment in the UAV's tracking of the USV. To address these issues, this paper proposes a predictive beam tracking scheme based on integrated sensing and communication (ISAC) for sea-air networks. We develop a wide and narrow beam switching scheme based on sub-array selection, where a time allocation factor is optimized to balance robust state sensing in the wide beam mode and high-rate communication in the narrow beam mode. Specifically, a wide beam mode provides full USV coverage and state sensing, while a narrow beam mode exploits the estimated state for high-gain communication with the communication receiver (CR) mounted on the USV. To characterize the CR motion, a sea-air state evolution model is derived by jointly considering the surge, sway, yaw, and sea induced disturbances of the USV. For the extended target USV, the measurement equation is constructed from multiple scatterer observations, with the measurement noise caused by sea clutter modeled, and an extended Kalman filter (EKF) based CR state prediction and estimation method is developed. In addition, the effect of sea clutter on sensing accuracy is incorporated into the time allocation optimization problem to adjust the time of the wide beam mode. Simulation results demonstrate that the proposed scheme achieves higher tracking accuracy than the state-of-the-art benchmark schemes.
comment: 13 pages, 11figures
Sample-based detectability and moving horizon state estimation of continuous-time systems
In this paper we propose a detectability condition for nonlinear continuous-time systems with irregular/infrequent output measurements, namely a sample-based version of incremental integral input/output-to-state stability (i-iIOSS). We provide a sufficient condition for an i-iIOSS system to be sample-based i-iIOSS. This condition is also exploited to analyze the relationship between sample-based i-iIOSS and sample-based observability for linear systems, such that previously established sampling strategies for linear systems can be used to guarantee sample-based i-iIOSS. Furthermore, we present a sample-based moving horizon estimation scheme, for which robust stability can be shown. Finally, we illustrate the applicability of the proposed estimation scheme through a biomedical simulation example.
Linear Power System Modeling and Analysis Across Wide Operating Ranges: A Hierarchical Neural State-Space Equation Approach
As modern power systems exhibit increasingly high-dimensional, nonlinear, and uncertain characteristics, the applicability of classical linear state-space methods is severely challenged. Existing paradigms struggle to reconcile the analytical transparency of physics-based models with the continuous nonlinear generalization of AI. To address this, the Hierarchical Neural State-Space Equation (HNSSE) framework is proposed. At the component level, the formulated Neural State-Space Equation (NSSE) extends neural ordinary differential equations (NODEs) to learn continuous dynamic manifolds across varying conditions while strictly preserving local analytical transparency. At the system level, a hierarchical architecture analytically fuses components via network constraints, constructing an interaction-consistent global NODE while circumventing the curse of dimensionality. To ensure robust convergence under noisy measurements, a training strategy synergizing spatiotemporal slicing, physics-informed curriculum learning, and Expectation-Maximization-based refinement is established. Validation on the large-scale Guangdong Power Grid demonstrates the framework's remarkable performance in interpretable state-space reconstruction, high-fidelity trajectory prediction, continuous stability perception, and noise robustness. Comprehensive comparisons substantiate HNSSE's superiority as a unified, interpretable paradigm for complex power system modeling.
comment: 10 pages, 10 figures, 1 table
Low-Thrust Orbital Differential Games with Speed Constraint Enforcement Using Cost Weighting
This paper considers the problem of a low-thrust spacecraft pursuit-evasion differential game with an arbitrary terminal relative speed constraint. It addresses the terminal phase of the engagement for two relatively close spacecraft near a circular orbit. The problem is formulated as a linear-quadratic zero-sum differential game, with soft constraints on the terminal relative position and velocity, and running costs on the players' control efforts. An analytical, closed-loop, minimum-fuel-consumption optimal guidance law is derived for each player, forming a saddle-point solution. It is proven that any terminal speed can be achieved by properly choosing the weighting parameters of the cost function. To verify the optimality of the solution, a conjugate point analysis is performed when the cost function velocity weighting matrix is either positive or negative definite. The negative-definite case arises at high terminal speeds and is seldom seen in the literature. The performance of the derived guidance law is evaluated in simulations for different target maneuvers and compared to a state-of-the-art optimal-control-based guidance law. The simulations show that the derived guidance law satisfies the constraints and offers a substantial advantage over the optimal-control-based guidance law when the target is optimally evading.
comment: This work was submitted for journal publication. 22 pages and 9 figures
Control Barrier Function only Formation Tracking in Multi-Agent Systems
This paper presents a real-time control framework for formation tracking of heterogeneous multi-agent systems with non-linear dynamics. The proposed method formulates a single Control Barrier Function-like constraint within a quadratic optimization setting that addresses formation tracking. Relying on the relative information of neighboring agents, the controller is designed to operate without the need for manual parameter tuning or a separate nominal formation controller. The leader-follower framework is validated through simulations of moving formations.
comment: 9 pages, 2 figures
Physics-informed sparse identification-based tube model predictive control for aerial vehicles
Autonomous aerial vehicles necessitate control strategies that balance computational efficiency with robust performance in dynamic operational environments. This paper proposes a model predictive control (MPC) framework for aerial platforms that leverages physics-informed machine learning (PIML) to achieve an optimal balance between computational tractability and robust performance. At the core of the proposed approach lies a sparse, control-affine model identified via the PIML method, which provides a parsimonious yet interpretable representation of the system dynamics by embedding first-principles knowledge and learning residual uncertainties from operational data. This model is incorporated within a robust MPC scheme that adopts a high-order Runge-Kutta discretization to ensure prediction accuracy and an adaptive tube-based mechanism to guarantee constraint satisfaction under uncertainty. The online adaptation of the tube, directly informed by the residual error of the PIML model, ensures robust stability without introducing excessive conservatism. Rigorous theoretical proofs are provided to establish recursive feasibility and stability. Numerical simulations and experiments on a quadrotor demonstrate that our method significantly reduces computational load compared to nonlinear MPC and robust MPC using a first-principles high-fidelity model, while outperforming PID, nonlinear MPC, neural-network-based MPC, and fixed-tube robust MPC in tracking performance and robustness, showcasing the practical efficiency of the proposed PIML-based control synthesis for resource-constrained aerial systems.
comment: 23 pages
Online Learning-Based Control with Guaranteed Error Bounds for a Class of Nonlinear Systems
In this paper, we present a learning-based control for a class of nonlinear systems that guarantees exponential stability as well as bounded output errors. The control is based on the Gaussian Process Submodel Online Learning (GPSOL) algorithm and the Disturbance Error Rate Limiting (DERL) algorithm, both of which were developed in previous work. The GPSOL algorithm provides a method to learn Gaussian Process (GP) models for subsystems online, whereas the DERL algorithm allows to limit the rate of the prediction error of these GP models. The focus of this paper is the utilization of the GP model within an adaptive controller and the derivation of corresponding stability conditions and system peak-to-peak gains by means of linear matrix inequalities (LMIs). These peak-to-peak gains are then used to prescribe a desired prediction error rate for the DERL algorithm to achieve user-defined output error bounds. The gains and the related bounds were successfully verified using a simulation model. Furthermore, results form a successful experimental validation of the bounds and the overall control structure on a pneumatic test rig are presented. While the control scheme and error bounds proposed in this paper are limited to first-order single-input-single-output systems, an extension to certain classes of higher-order and multiple-input-multiple-output systems is expected to be forthcoming.
comment: Accepted at IFAC 2026 (23rd IFAC World Congress, Busan, Korea)
Optimal Calibration of the Endpoint-corrected Hilbert Transform
Accurate, low-latency estimates of the instantaneous phase of oscillations are essential for closed-loop sensing and actuation, including (but not limited to) phase-locked neurostimulation and other real-time applications. The endpoint-corrected Hilbert transform (ecHT) reduces boundary artefacts of the Hilbert transform by applying a causal narrow-band filter to the analytic spectrum. This improves the phase estimate at the most recent sample. Despite its widespread empirical use, the systematic endpoint distortions of ecHT have lacked a principled, closed-form analysis. In this study, we derive the ecHT endpoint operator analytically and demonstrate that its output can be decomposed into a desired positive-frequency term (a deterministic complex gain that induces a calibratable amplitude/phase bias) and a residual leakage term that sets an irreducible variance floor. This yields (i) an explicit characterisation and bounds for endpoint phase/amplitude error, (ii) a mean-squared-error-optimal scalar calibration, and (iii) practical design rules relating window length, filter bandwidth and order, and centre-frequency mismatch to residual bias via an endpoint group delay. The resulting calibrated ecHT achieves near-zero mean phase error and remains computationally compatible with real-time pipelines. Code and analyses are provided at https://github.com/eikeosmers/cecHT.
Cost-Effective Design of Grid-tied Community Microgrid
This study aims to develop a cost-effective microgrid (MG) design that optimally balances the economic feasibility, reliability, efficiency, and environmental impact in a grid-tied community MG. A multi-objective optimization framework is first employed to generate feasible MG configurations considering economic, reliability, efficiency, and environmental objectives. Subsequently, a preference-based deep reinforcement learning (DRL) framework is utilized to evaluate and select preferred configurations using a scalarized reward function. This combined approach enables systematic exploration of trade-offs among conflicting objectives and supports informed decision-making for community MG planning. Sensitivity analyses are conducted to evaluate the system performance under varying load demand and renewable energy fluctuations. Besides, an economic sensitivity assessment examines the impact of electricity prices and capital costs on the levelized cost of energy (LCOE). The proposed MG configuration achieves high reliability, satisfying 100% of the load, even under adverse weather conditions. The proposed framework attains an efficiency of 91.99\% while maintaining a carbon footprint of 302,747 kg/year, which is approximately 95\% lower than the annual emissions associated with a conventional grid-supplied energy system. The economic analysis indicates a net present cost of \$4.83M with a competitive LCOE of \$0.208/kWh. In addition, the operation cost is \$201,473 per year with a capital investment of \$1.42M, rendering it a financially viable alternative to conventional grid-dependent systems. This work can be valuable in identifying effective solutions for supplying reliable and cost-effective power to regional and remote areas.
Ranking Quantilized Mean-Field Games with an Application to Early-Stage Venture Investments
Quantilized mean-field game models involve quantiles of the population's distribution. We study a class of such games with a capacity for ranking games, where the performance of each agent is evaluated based on its terminal state relative to the population's $α$-quantile value, $α\in (0,1)$. This evaluation criterion is designed to select the top $(1-α)\%$ performing agents. We provide two formulations for this competition: a target-based formulation and a threshold-based formulation. In the former and latter formulations, to satisfy the selection condition, each agent aims for its terminal state to be \textit{exactly} equal and \textit{at least} equal to the population's $α$-quantile value, respectively. For the target-based formulation, we obtain an analytic solution and demonstrate the $ε$-Nash property for the asymptotic best-response strategies in the $N$-player game. Specifically, the quantilized mean-field consistency condition is expressed as a set of forward-backward ordinary differential equations, characterizing the $α$-quantile value at equilibrium. For the threshold-based formulation, we obtain a semi-explicit solution and numerically solve the resulting quantilized mean-field consistency condition. Subsequently, we propose a new application in the context of early-stage venture investments, where a venture capital firm financially supports a group of start-up companies engaged in a competition over a finite time horizon, with the goal of selecting a percentage of top-ranking ones to receive the next round of funding at the end of the time horizon. We present the results and interpretations of a set of numerical experiments for both formulations discussed in this context, which illustrate that the target-based formulation closely approximates the threshold-based formulation in the scenarios considered.
Data-driven Sensor Placement for Predictive Applications: A Correlation-Assisted Attribution Framework (CAAF)
Optimal sensor placement (OSP) is critical for efficient, accurate monitoring, control, and inference in complex physical systems. We propose a machine-learning-based feature attribution (FA) framework to identify OSP for target predictions. FA quantifies input contributions to a model output; however, it struggles with highly correlated input data often encountered in practical applications for OSP. To address this, we propose a Correlation-Assisted Attribution Framework (CAAF), which introduces a clustering step on the candidate sensor locations before performing FA to reduce redundancy and enhance generalizability. We first illustrate the core principles of the proposed framework through a series of validation cases, then demonstrate its effectiveness in realistic dynamical systems such as structural health monitoring, airfoil lift prediction, and wall-normal velocity estimation for turbulent channel flow. The results show that the CAAF outperforms alternative approaches that typically struggle due to the presence of nonlinear dynamics, chaotic behavior, and multi-scale interactions, and enables the effective application of FA for identifying OSP in real-world environments.
Robotics
Learning Action Priors for Cross-embodiment Robot Manipulation
Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.
ForceBand: Learning Forceful Manipulation with sEMG
Human demonstrations are a scalable data source for learning robot manipulation policies. However, common sources of human demonstration data, such as motion-capture trajectories and internet videos, capture mostly motion and appearance while missing the contact forces that are critical for force-sensitive manipulation. In this paper, we introduce ForceBand, a low-cost wrist-worn sEMG system that turns human muscle activity into force-enriched demonstrations. We first collect a 10-hour multimodal dataset containing egocentric video, sEMG, IMU, and fingertip force measurements across diverse actions and objects. Using this dataset, we pre-train an EMG2Force model that predicts per-finger forces from sEMG and IMU signals. After a short user-specific calibration, users can collect target-task demonstrations using only ForceBand and video; EMG2Force then labels these demonstrations with per-finger force traces, producing force-augmented demonstrations for robot policy learning. Experiments show that ForceBand recovers fine-grained fingertip interactions with over 50% lower force prediction error than vision-based baselines and achieves an 87% success rate on pick, squeeze, and place tasks that require object-specific force control across objects with diverse shapes, sizes, and weights. Project website: https://forceband-emg.github.io
Deep Reinforcement Learning-Enhanced Event-Triggered Data-Driven Predictive Control for a 3D Cable-Driven Soft Robotic Arm
Soft robots are challenging to control due to their nonlinear and time-varying dynamics. Data-enabled predictive control (DeePC) offers a model-free alternative by directly leveraging measured input-output trajectories to construct a predictive controller. However, its receding-horizon formulation requires solving a constrained optimization problem at every sampling instant, which can be computationally demanding for real-time deployment on resource-limited robotic platforms.To address this limitation, we propose an adaptive reinforcement-learning-based event-triggered DeePC (RL-ET-DeePC) framework for soft robotic control. A model-free RL policy is trained to determine when to invoke the DeePC optimizer based on the current system state representation, thereby reducing unnecessary optimization calls while preserving closed-loop performance.Simulation results show that RL-ET-DeePC reduces optimization frequency by up to 66% compared to periodic DeePC, while maintaining comparable tracking accuracy. Hardware experiments on a three-dimensional cable-driven soft robotic arm demonstrate zero-shot transfer, achieving a 34% reduction in optimization frequency with tracking accuracy comparable to periodic DeePC and more consistent performance than a static threshold-based event-triggered baseline.
Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations
Robot crowd navigation requires the ability to infer human intentions while accounting for the structural constraints of the environment. Currently, deep reinforcement learning (DRL) provides a promising method for learning navigation policies that understand human intentions. However, most of them rely on limited scene representations, treating pedestrians as simple 2D points and ignoring rich visual cues from both humans and the environment. To address this issue, we introduce iCrowdNav, a novel visual crowd navigation method with intention-aware scene representations, to encode behavioral and structural context from egocentric visual observations. Our method employs two key components: a spatio-temporal encoder for extracting occupancy features of the scene, and Intent-Interact Former (I$^2$ Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions. These features are integrated into a compact state embedding that supports effective DRL policy training. Extensive experiments show that our method achieves superior performance over baselines, and real-world deployment demonstrates vision-based crowd navigation.
RoboAtlas: Contextual Active SLAM
We present RoboAtlas, a contextual Active SLAM framework that adaptively balances geometric exploration and semantic reasoning using a scalable 3D semantic mapping system, OpenRoboVox. RoboAtlas integrates frontier exploration, global semantic-map reasoning, and egocentric VLM-based reasoning through a contextual multi-armed bandit that transitions from exploration to semantically guided navigation as scene understanding improves. We evaluate the system in simulation and on a Unitree Go2 robot in large-scale real-world environments exceeding 1800 m2 with approx. 30k mapped semantic instances, achieving a 100% task success rate. On the GOAT-Bench "Val Unseen" benchmark, RoboAtlas achieves state-of-the-art performance with highest reported success rate (SR) of 90.6%, using GPT-4o, improving over the strongest prior baseline by 17.8 percentage points in SR. Using the much smaller Qwen2.5-VL-7B model, it still achieves 88.8% SR, outperforming all baselines using GPT-4o in SR, and revealing the importance of the information gained by our semantic mapping framework over simply replacing the underlying foundation model. The results demonstrate that grounding foundation models with large-scale 3D semantic maps enables robust and efficient contextual Active SLAM.
comment: Alexander Schperberg and Shivam K. Panda made equal contribution
In-Context World Modeling for Robotic Control
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.
G2DP: Diffusion Planning with Spatio-Temporal Grid Guidance
In autonomous driving, diffusion-based planners have emerged as a promising paradigm for robust motion planning in dense and interactive traffic, as they can effectively model diverse driving behaviors. However, their inherent stochasticity often requires explicit guidance during denoising to ensure safety and route adherence for robust closed-loop execution. Existing guidance typically relies on sparse, entity-centric geometric queries or post-hoc refinement, yielding limited situational awareness and fragile performance in interactive scenes. To address this issue, we propose G2DP (Grid-Guided Diffusion Planning), a diffusion-based planner that directly enforces dense environmental constraints through inference-time guidance. Specifically, G2DP constructs a differentiable spatio-temporal cost volume by fusing probabilistic future occupancy distributions with a route-progress map. By formulating this volume as a continuous safety energy functional, it injects dense gradients directly into the denoising loop, actively steering trajectory generation toward collision-free and progress-optimal regions. Extensive closed-loop evaluations show that G2DP achieves state-of-the-art performance on nuPlan, outperforming the strongest imitation-learning baseline by +7.2 points in reactive score. It further maintains top scores in zero-shot transfers to interPlan and DeepScenario benchmarks, with collision avoidance improving by +10.15 over the unguided approach on interPlan. These results demonstrate that spatio-temporal cost grids serve as an effective representation for robust guidance in diffusion-based planning.
FAR-LIO: Enabling High-Speed Autonomy through Fast, Accurate, and Robust LiDAR-Inertial Odometry IROS2026
Robust and accurate odometry estimation is essential in modern robotics. In environments characterized by highly dynamic motion and sensor noise, odometry estimation becomes increasingly challenging. Autonomous racing combines both factors in an unstructured setting, where minimizing odometry latency is essential for stable closed-loop control. This paper introduces FAR-LIO, a highly optimized CUDA-accelerated LiDAR-inertial odometry framework developed for Fast, Accurate, and Robust performance. Our system leverages a novel CUDA-based voxel hashmap to enable parallelized nearest-neighbor search and efficient map updates. We employ a sparsity-aware Generalized Iterative Closest Point algorithm with adaptive thresholding on top of the CUDA-based voxel hashmap with adaptive density to achieve low-latency without compromising accuracy. An Extended Kalman Filter serves as a robust backend. It utilizes an upsampling and delay compensation strategy to fuse the LiDAR odometry with high-frequency IMU data, thereby ensuring a robust and smooth odometry output. We evaluate FAR-LIO across four different sensor setups, using both public datasets and data from two autonomous racecars driving at speeds of up to 250 km/h. FAR-LIO achieves an average 6.9% reduction in the positional error and 38.4% lower runtime compared to state-of-the-art baselines on target hardware using a single parameter set. This demonstrates its computational efficiency and broad applicability. To build upon our work, our code is available open-source on https://github.com/TUMFTM/FAR-LIO.
comment: 8 pages, accepted for publication at IROS2026 (IEEE/RSJ International Conference on Intelligent Robots and Systems 2026)
Emcar: Embodied Controller for Animating Robots
This chapter describes EMCAR, a novel software tool for programming robot motion that leverages the unique affordances of artistic practices such as puppetry and drawing to conceive, design, and program novel interactions and realize new use cases for HRI. The advantage of this no-code platform is that it expands creative applications for collaborative robots - putting robots directly in the hands of artists - and provides an inclusive environment that enables individuals with little or no technical backgrounds to engage meaningfully in collaborations and robotics research.
comment: Published in Lecture Notes in Computer Science, DOI: 10.1007/978-3-032-15501-6_11
FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation
Vision-Language-Action (VLA) models are often constrained by the imitation ceiling imposed by sub-optimal data. While Reinforcement Learning (RL) fine-tuning can surpass this limit, it is notoriously sample inefficient. This challenge arises from two core issues: (1) catastrophic initial unlearning due to an unstable Q-function and (2) inefficient policy updates caused by low-quality exploration data, often forcing a reliance on costly human interventions. We introduce FORCE, a 3-stage framework that stabilizes fine-tuning by tackling both issues. FORCE first incorporates a Value-Calibrated Warm-Up phase, utilizing on-policy rollouts to mitigate the distributional shift of the Q-function. Subsequently, during the online stage, this calibrated Q-function acts as a filter for both the policy's own action proposals and expert data, ensuring only high-value actions are used for the policy update. We evaluate FORCE on various simulation and real-world tasks, and the result shows that FORCE achieves a 79% absolute improvement in success rates and outperform prior RL methods by 10%, while accelerating training by 32.5%. Critically, it mitigates the common success rate drop and achieves this robust performance without human intervention, marking a significant step towards deploying capable and autonomous robotic agents.
Action ControlNet: A Lightweight Delay-Aware Adapter for Smooth Asynchronous Control in Vision-Language-Action Models
Vision-language-action (VLA) models have shown strong potential for general-purpose robot manipulation, but their inference latency remains a major obstacle to stable high-frequency control. Asynchronous execution mitigates this bottleneck by overlapping policy inference with action execution, yet the next action chunk is still predicted from stale observations while the robot continues to move. Direct chunk stitching therefore introduces handoff discontinuities, action jitter, and failures in contact-rich manipulation. Existing remedies typically require either full-policy retraining or architecture-specific runtime logic. This work proposes Action ControlNet (ACNet), a lightweight delay-aware adapter that uses the executed motion suffix as a residual condition for a mostly frozen action head. ACNet leaves the pretrained backbone unchanged, introduces few trainable parameters, and remains compatible with generative action heads such as diffusion and flow matching. On Kinetix, Meta-World MT50, and a real-world SO-ARM101 platform, ACNet improves robustness under inference delay and yields smoother asynchronous trajectories than direct chunk stitching, while remaining more lightweight than full delay-conditioned retraining.
A Sensorised Lattice Footplate for a Semi-Active Prosthetic Foot
This paper investigates whether magnetic plantar sensing can be embedded directly inside the load-bearing compliant element of a low-cost semi-active prosthetic foot. We present a prototype integrating a sensorised 3D-printed lattice footplate, a servo-adjustable hydraulic damper, and a reduced-order ankle model. The damper is experimentally characterised to relate adjustment angle to damping coefficient. Controlled compression tests show tunable lattice stiffness, while cyclic normal loading shows that the embedded sensor tracks the testing-machine reference force, supporting plantar-force estimation without an external insole layer. Static-posture trials under approximately body-weight loading show that forefoot and rearfoot loading distributions are separable across four prescribed stance configurations, providing a preliminary check of the sensing pipeline. A feedforward damping schedule approximates the dorsiflexion trend of a reference ankle trajectory through early-to-mid stance, while exposing the expected limitation that a purely dissipative mechanism cannot generate active push-off. Together, these results demonstrate that sensing can be embedded inside the load-bearing compliant element of a prosthetic foot and used to drive semi-active damping.
comment: 6 pages, 7 figures, ICAC
Mixture-of-Experts RL for Fault-Tolerant Legged Locomotion
Legged robots deployed in planetary exploration and other remote environments must maintain reliable locomotion despite actuator failures and challenging terrain conditions. Although reinforcement learning has achieved strong results in legged locomotion, monolithic policies can struggle to efficiently represent the diverse control strategies required to compensate for different fault conditions. In this work, we propose a fault-aware modular control architecture that explicitly leverages fault-diagnosis information to activate specialized control experts associated with distinct actuator failure modes. Experimental results show that explicit fault-conditioned modular policies consistently outperform monolithic policies of comparable size, achieving higher locomotion performance across failure scenarios. Moreover, the proposed modular architecture retains competitive performance even under significantly reduced network capacity, highlighting its suitability for compute-constrained robotic platforms, such as those typically employed in space applications. The code associated with this work is available at: https://github.com/iit-DLSLab/fault-locomotion-isaaclab.
From Rubble Simulation to Active Magnetic Mapping: Quantum Sensing for Disaster Response
Locating survivors of building collapses within the first 72 hours is a critical challenge in disaster response, and existing sensing modalities provide only partial information about the structure beneath the rubble. This paper proposes drone-based quantum magnetometry as a complementary modality and develops a simulation pipeline spanning rubble physics, sensor-array deployment, and active spatial reconstruction. We use Unreal Engine to generate a steel-reinforced concrete parking-garage collapse and compute the induced magnetic field via a per-triangle dipole approximation, establishing that meaningful magnetic structure is recoverable in the sub-pT to sub-nT range from roughly 1 m above the roofline. Then, we feed sparse multi-sensor samples into a Gaussian Process Regression back-end driven by Bayesian active sampling and validate the pipeline across multiple independent collapse realizations; a three-sensor array optimizes the trade-off between gradient resolution and UAV payload constraints, and active sampling reaches peak structural correlation in roughly $100$ samples. Together, these results indicate that quantum-grade sensing could become a useful tool for drone-based structural analysis and potentially void detection in collapsed buildings.
comment: 9 pages, seven figures
DSP-SLAM++: A Unified Framework for Multi-Class, High-Fidelity Object SLAM in the Wild
Existing object-aware SLAM systems force a trade-off between real-time performance, multi-class support, and the generation of high-fidelity, semantically coherent object models. To address this trade-off, we present DSP-SLAM++, which extends the DSP-SLAM framework with an asynchronous mapping pipeline for real-time performance and dedicated sensor fusion adaptations for a monocular fisheye-LiDAR suite. Experiments demonstrate that our system generates fine-grained, geometrically-complete shapes for multiple object classes while eliminating severe mapping thread bottlenecks by reducing maximum object processing latency by up to 70\% compared to the state-of-the-art baseline, enabling robust, real-time performance on a challenging 25 Hz multi-class datasets. This work makes high-fidelity, multi-class object SLAM more practical for real-world applications like autonomous driving and robotic manipulation by enabling its use on platforms with common fisheye-LiDAR sensor setups. The open-source code is available at: [github.com/AUBVRL/DSP-SLAMpp].
comment: 9 pages, 9 figures
DeformGen: Dynamics-Based Topology Augmentation for Deformable Manipulation Policy Learning
Demonstration augmentation is proposed for cost-efficient data acquisition, but existing methods are fundamentally limited in deformable manipulation due to two challenges: (1) the state space is high-dimensional with physics-induced constraints, making valid configurations impossible to reach via low-dimensional pose perturbations; and (2) trajectory transfer is non-equivariant, as material points no longer move rigidly together under deformation. We present DeformGen, a dynamics-based augmentation framework that achieves topological diversity for deformable objects. For the state challenge, DeformGen expands the valid state distribution by applying localized physical disturbances and forward-simulating the dynamics to obtain topology-coherent, physically plausible deformable states. For the trajectory challenge, DeformGen transfers source manipulation trajectories via deformation-field warping, which lifts per-particle displacements into a continuous spatial function to adapt the end-effector trajectory consistently with the deformed geometry. In this way, our method jointly augments the state distribution and its associated manipulation behavior. Experiments on high-fidelity deformable manipulation benchmarks show that DeformGen generally improves policy learning compared with training on the original demonstrations alone and with rigid-style augmentation baselines.
A 3D-Printable Dataset for Fair Testing and Comparisons of Tactile Sensors
Existing texture datasets for tactile sensing primarily consist of sensor readings from a specific sensor interacting with available surfaces/objects rather than describing the textures themselves, limiting fair comparison between tactile sensors and hindering reproducible research. In this work, we introduce a 3D-printable dataset of mathematically defined textures designed to be fabricated reliably across different printers and filament types. The dataset consists of six parametrically generated surface patterns derived from combinations of sine-wave and Fourier-based functions, giving controlled variation in spatial frequency, amplitude, and directional structure. We evaluate the reproducibility of these textures across three popular 3D printers and multiple filament types by measuring variance in images captured using an optical TacTip sensor under controlled contact conditions. Our results show that print quality, particularly peak sharpness and stringing, affects tactile variance, with higher-end printers producing significantly more consistent signatures. Classification experiments using neural networks and PCA-based models further demonstrate that high-quality prints support strong within-printer generalisation, while cross-printer generalisation remains challenging due to geometric inconsistencies. This work establishes the first openly available, physically reproducible 3D-printed texture benchmark, providing a foundation for fair comparison of tactile sensors.
TacVerse: A Multi-Sensor Dataset and Benchmark for Cross-Sensor Vision-Based Tactile Perception
Vision-based tactile sensors (VBTSs) enable robots to infer contact geometry and force-related cues by imaging deformation through an internal camera, yet generalisation across sensor designs remains poorly understood. We present TacVerse, a multi-sensor dataset and benchmark for cross-sensor vision-based tactile perception. The dataset contains 106,800 tactile images from seven VBTSs and supports three downstream tasks: shape classification, grating classification, and force regression. Experiments are conducted under three settings: within-sensor training, zero-shot cross-sensor transfer, and few-shot adaptation. Strong within-sensor performance across all tasks indicates that the collected tactile observations are informative for the target objectives. Direct cross-sensor transfer, however, leads to substantial degradation. Shape classification is comparatively robust, whereas grating classification and force regression are more sensitive to sensor shift. Few-shot adaptation for force regression consistently improves performance on unseen target sensors but does not fully close the gap to within-sensor upper bounds. A representation study further shows that MAE (Masked Autoencoder) pretraining provides the most consistent gains across tasks and sensors. TacVerse provides a controlled testbed for studying sensor shift, data-efficient adaptation, and self-supervised learning in tactile perception.
Beyond a Shadow of a Doubt: Close Proximity Geometry Reconstruction Using FMCW Radar Shadow Effects
Reliable perception in adverse conditions remains challenging for autonomous systems, as cameras and LiDAR degrade in poor lighting or weather. Millimetre-wave FMCW radar is robust to such conditions, but its elevation collapse limits geometric reasoning. We observe that vehicle chassis occlude radar rays and form a distinctive geometric shadow, and its consistency can enable us to infer useful information about objects whose returns intersect this shadow. Motivated by this observation, we propose a method to recover the 3D, in-plane inclination of nearby slender vertical objects from this cue. The object inclination is retrieved without assumptions about the wider scene, but through an analytical, closed-form mapping between its radar return boundaries and the opening angle. Validation in simulation and experimentation on a Navtech CTS350-X radar shows that inclinations can be estimated under practical conditions, with segmentation of the object in the radar scan emerging as the main bottleneck. This work highlights chassis shadows as a novel geometric cue, extending the role of 2D rotating radar beyond localisation and toward 3D scene reconstruction.
ROAD-VLA: Robust Online Adaptation via Self-Distillation for Vision-Language-Action Models
Effective online adaptation of vision-language-action (VLA) models remains challenging, as sparse rewards provide weak supervision for high-dimensional autoregressive action policies. Although self-distillation can in principle provide denser training signals, we find that text-based privileged teachers conditioned on demonstrations, retrieved experiences, or high-level plans are ineffective for VLA adaptation, exposing a modality gap between symbolic guidance and low-level robot actions. We propose ROAD-VLA, an advantage-guided self-distillation framework that constructs a proximal teacher directly in action space by perturbing action-token logits with calibrated advantage estimates. This converts sparse rewards into dense token-level supervision while keeping the teacher close to the current policy. We further derive a policy-improvement lower bound under calibrated advantages and accurate teacher matching. Across seven robotic manipulation environments with in-distribution and out-of-distribution shifts, ROADVLA outperforms PPO in nearly all settings, demonstrating robust online VLA adaptation.
MIL-LC: A Robust Magnetometer-Inertial-LiDAR Fusion Multimodal Localization Framework
Localization in challenging environments, such as GNSS-denied, geometrically repetitive, or textureless scenes commonly found in offices, hotels, and underground parking facilities, remains an open problem for reliable autonomous mobile robot (AMR) deployment. Single-modality localization methods are inherently limited by the constraints of individual sensors. Although multimodal fusion frameworks have shown improved robustness, most existing approaches still rely heavily on geometric or texture features, or on infrastructure-based beacons, which increase installation and maintenance costs while reducing deployment flexibility. Recently, ambient magnetic field (AMF)-based localization has attracted growing attention because it does not depend on geometric or texture features, nor does it require additional infrastructure, making it a promising complementary modality for AMR localization. However, existing studies have only explored such fusion in pedestrian scenarios using smartphone-mounted sensor suites, and practical solutions for AMR systems remain largely unexplored. To address this gap, this article proposes a magnetometer-inertial-LiDAR fused multimodal localization framework with a custom-designed sensor suite, termed MIL-LC, which provides reliable localization even when LiDAR suffers from geometric degeneration or when the magnetic map changes during long-term deployment. Extensive experiments in both simulation and real-world environments demonstrate that the proposed MIL-LC framework achieves robust and accurate localization performance.
StairMaster: Learning to Conquer Risky Hollow Stairs for Agile Quadrupedal Robots
Climbing hollow stairs remains a challenging problem for quadruped robots due to the high risk of leg trapping, severe depth sparsity, and high-frequency depth-sensing noise. In this paper, we propose StairMaster, a novel three-stage reinforcement learning framework for stable locomotion on such extreme discontinuous terrains. Our architecture integrates a Cross-Attention mechanism to extract structural features from noisy depth data, alongside a Spatial-aware Recurrent Unit (SRU) that maintains robust spatio-temporal memory to mitigate perception blind spots. To bridge the sim-to-real gap in depth perception, we propose a high-fidelity sim-to-real depth sensor modeling pipeline that faithfully replicates real-world sensor artifacts. Additionally, we employ a 3D waypoint-guided active perception reward for proactive sensing, alongside hollow gap kinematic and stair edge penalties to ensure precise foothold placement. We successfully deployed StairMaster on a Unitree Go2 robot, demonstrating its ability to conquer hollow stairs with an unprecedented incline of up to 55$^\circ$ through zero-shot transfer. To the best of our knowledge, this is the first RL-based policy to achieve such steep hollow stair climbing in real-world environments. Project Website: https://sivan666666.github.io/StairMaster/.
comment: 9 pages, 9 figures
Stage-Aware and Roughness-Constrained Diffusion Policy for Multi-Stage Robotic Polishing
Polishing is a critical finishing process in high-end manufacturing fields such as aerospace, where surface quality directly affects the service performance and reliability of components. Robotic imitation learning provides a flexible solution for such tasks, but current methods remain limited in industrial polishing because of long-horizon dependencies, uncertain stage transitions, and the difficulty of modeling and regulating coupled process parameters. To address these issues, this paper proposes a Stage-Aware and Roughness-Constrained Diffusion Policy (SRDP) for robotic polishing. SRDP infers the process-stage posterior from multimodal observation histories and uses it to condition the shared reverse denoising process, enabling stage-consistent action generation without external stage labels during execution. Furthermore, a roughness-oriented process-constrained diffusion sampling method is incorporated to generate constrained feed speed and normal contact force under stage-wise preset spindle speeds, thereby improving process consistency and physical feasibility. Systematic experiments are conducted on two representative scenarios, namely spacecraft cabin coating-surface polishing and inner-cavity structural surface finishing. Comparisons with advanced baselines, ablation studies, and real-robot validations comprehensively evaluate the proposed method. The results show that SRD improves stage-transition stability, process-parameter consistency, and final surface quality across different polishing scenarios.
Learning Asynchronous Upper-body Task-space Trajectory Tracking Policy for Humanoid Robots
High-level humanoid planners often output sparse task-space, low-rate trajectories, whereas whole-body controllers run at high frequency. This creates temporal asynchrony between the planning and execution, and structural incompleteness for full-body control. We propose an asynchronous upper body task-space tracking framework for humanoids. A student policy is initialized by teacher-student distillation, conditioned on the full cached future trajectory and an execution-time index, and trained with a sliding-window global reward to reduce frame drift without explicit frame estimation. For task-specific post-training, an MPC module completes sparse references into floating-base and upper-body guidance, while action- and FK level self-guidance constrain policy drift. Simulation and Unitree G1 hardware experiments show improved tracking under low update rates, stronger performance than synchronous and decoupled baselines, and safer adaptation to out-of-distribution motions.
comment: 10 pages, 8 figures
Memory-Efficient Policy Libraries with Low-Rank Adaptation in Reinforcement Learning
When fine-tuning Large Language Models (LLMs), there has been success in minimizing both memory usage and computation with Parameter-Efficient Fine-Tuning (PEFT), like Low Rank Adaptation (LoRA). In this article, we have explored whether this approach is transferable to the world of robotics and Reinforcement Learning (RL), allowing learning with reduced memory usage and improved computational performance. Specifically, we focused on a version of multi-task robotics, where a library of specialist policies are created. In such a library memory efficiency is especially important. We used a Proximal Policy Optimization (PPO) algorithm and fine-tuned a baseline model to different tasks using LoRA. Our results demonstrate that, depending on the hyperparameters, LoRA can minimize memory usage by a factor of 20-160 compared to full fine-tuning of all layers. This implies a 90-95% storage saving when deploying a library of many (10-50) specialized policies, which can be the differentiating factor between being able to store the entire library in memory or having to use swap-memory in an applied robotics setting. At the same time, our results indicate that there is no significant difference in the success-rate between full fine-tuning and LoRA fine-tuning for the selected tasks.
SA-LIVO: Efficient LiDAR-Inertial-Visual Odometry with Subspace-Aware Degeneracy Handling
Tightly coupled LiDAR-visual-inertial odometry (LIVO) fuses precise geometric depth with complementary visual measurements, yet its exteroceptive sensors face independent failure modes: LiDAR degenerates when scan geometry is under-constrained, while visual measurements degrade under adverse illumination or texture absence. Existing countermeasures, including binary degeneracy detection, covariance inflation, and scene-level quality gating, operate at the modality level and leave the direction-dependent structure of the joint information matrix unaddressed. Consequently, visual residuals enter pose directions where LiDAR is well-constrained, while in deficient directions visual compensation disperses across the full state space rather than concentrating where needed. We propose SA-LIVO, a LiDAR-inertial-visual odometry system addressing these limitations through direction-selective fusion and information-efficient processing. The Subspace-Aware Information Fusion (SAIF) framework eigendecomposes the joint LiDAR-visual information matrix and applies a linear-clamp soft gate per eigendirection, attenuating degenerate directions while preserving observable ones at full strength. LiDAR and visual residuals are then jointly optimized in one InEKF loop at a shared linearization point. Since visual information contributes only where LiDAR is deficient, photometric Jacobians are assembled once before the loop and reused across iterations, avoiding the per-iteration cost of conventional iterated filters. Experiments on 29 sequences from three benchmarks (HILTI'22, New College, Oxford Spires) and concurrent-degradation scenarios show accuracy competitive with the strongest baselines and bounded drift where competing systems diverge. SA-LIVO averages 12.3 ms per frame on a laptop CPU and 26.8 ms on an embedded ARM board without GPU, with 3.6-6.3x lower peak memory. The code will be open-sourced.
comment: 20 pages, 12 figures, 5 tables
Power-Budgeted Underwater Vehicle Control via Constrained Reinforcement Learning
Underwater vehicles operate from a fixed onboard energy budget that propulsion rapidly depletes, so a controller that completes its task while drawing less thruster power directly extends mission range and endurance. Reinforcement learning yields capable model-free controllers for station-keeping and trajectory tracking, but optimizing task accuracy alone drives the policy toward oscillatory, energy-wasting actuation. The established remedy subtracts an energy penalty from the reward, yet this sets the task-power trade-off through a single weight with no physical units: a target power level cannot be specified, the weight must be re-tuned for every vehicle and task, and a mismatched weight can even raise power. This paper instead formulates energy-efficient underwater control as a constrained Markov decision process in which average thruster power is subject to an explicit budget, solved with a PPO-Lagrangian algorithm. The power level is set by declaring a budget in physical units, and a single dual variable is updated online to meet it for each vehicle and task, without manual weight search. Across three vehicles and four tasks in the MarineGym simulator, the energy-constrained policy draws the least power in all twelve settings, reducing it by 14--65\% (up to 64.9\%) over a task-only baseline and below an energy-reward baseline everywhere, while remaining the smoothest in ten settings and preserving task accuracy except in one deliberately power-limited regime. Imposing energy as an explicit constraint thus offers a tuning-free route to energy-efficient underwater control that needs no per-vehicle, per-task weight search.
comment: 10 pages, 10 figures
Learning to Adapt: Reptile-D-Learning for Robust and Efficient Control Under Parametric Uncertainty
Learning-based Lyapunov Control (LLC) provides formal stability guarantees for nonlinear systems, but its validity relies heavily on accurate system models. Parameter variations and uncertainties may invalidate stability constraints, leading to costly retraining. Although D-learning can estimate Lyapunov derivatives without relying on explicit dynamics models, it remains limited by single-task dynamics and degrades under large parameter shifts. We propose Reptile-D-learning, a framework that leverages the Reptile meta-learning algorithm to capture shared dynamical structures across systems with different parameters, thereby learning a generalizable Lyapunov network initialization and a high-performance controller. Experiments on multiple nonlinear control systems demonstrate that Reptile-D-learning significantly improves both generalization and rapid adaptation to unseen parameter configurations.
Calousel: Extrinsic Calibration of Non-overlapping Multi-camera Systems from Pure Rotation IROS 2026
Extrinsic calibration of multi-camera systems with non-overlapping FOVs has been a challenging problem in the robotics literature. Conventional target-based methods impose substantial target setup overhead, either deploying large calibration targets or requiring pre-measured multi-target poses. Motion-based approaches instead suffer from drift error, scale ambiguity, and motion degeneracy. Securing both accuracy and usability, we propose a novel calibration method that leverages pure rotational motion, requiring only a single static calibration board. The key idea is to make all cameras sequentially observe the same target under a shared geometric reference, even without overlapping views. To integrate these time-separated observations, we formulate the problem using a latent turntable frame and a 3D error on SE(3) within a global optimization framework. We validate the proposed method on both a controlled camera rig and a full-scale vehicle platform with heterogeneous cameras, and analyze robustness under non-ideal turntable motion. Extensive experiments show that our approach maintains competitive accuracy without specialized precision hardware, proving its strong suitability for realistic on-site deployments. Our code is publicly available here.
comment: Accepted to IROS 2026. 8 pages, 7 figures
Event-Adaptive Motion Planning with Distilled Vision-Language Model in Safety-Critical Situations IROS 2026
Robot navigation in safety-critical scenarios faces significant challenges from unforeseen semantic events, where collisions arise primarily from the unpredictable behaviors of dynamic agents rather than unseen objects. While large vision-language models (VLMs) offer remarkable capabilities in commonsense reasoning, frequently invoking them within the continuous control loop introduces severe computational latency, fundamentally destabilizing physical execution. To address these challenges, we propose event-adaptive motion planning (EAMP), an efficient framework for VLM-based robot navigation. Specifically, a prompt-configurable semantic event trigger (PC-SET) selectively activates semantic intervention by continuously monitoring short temporal clips for behavioral anomalies. Upon triggering, an event-triggered distilled SemNav-VLM, fine-tuned via physically verified semantic distillation, maps detected anomalies into discrete strategy-level decisions. Subsequently, a semantic model predictive control (SMPC) module translates these strategies into dynamic reconfigurations of optimization objectives and geometric references. Extensive experiments in safety-critical logistics scenarios demonstrate that EAMP effectively aligns high-level reasoning with low-level control, significantly improving dynamic safety margins over existing baselines while preserving real-time efficiency.
comment: 8 pages, 8 figures, 4 tables. Accepted by IROS 2026
Reasonable Motion: A General ASP Foundation for Environment Constrained Movement Trajectory Computation
We present a general answer set programming based hybrid quantitative-qualitative method for computing constrained branching trajectory modes for moving objects in real-world settings. The method performs constrained traversal of an environment graph, enumerating geometrically admissible motion behaviours as stable models, each constituting a distinct trajectory mode characterised by both domain-dependent and independent factors such as derived event sequence, map topology, and domain norms. The hybrid trajectory computation method is generally applicable across motion characteristics typically encountered in diverse dynamic domains with moving objects, e.g., autonomous driving. We demonstrate applicability and highlight how computed trajectories are traceable to their underlying stable model, thereby affording verifiable interpretability that purely learned approaches cannot provide. We also perform an empirical evaluation with Argoverse 2, a large-scale real-world autonomous driving benchmark representative of the class of dynamic domains within the scope of the proposed method.
comment: Accepted at: LPNMR 2026 - 18th International Conference on Logic Programming and Non-monotonic Reasoning, 7 - 11 September 2026 - Klagenfurt, Austria
1000 Rallies: An Event-Camera Dataset and Real-Time Learned Ball-State Estimation for Robotic Table Tennis
Robotic table tennis has emerged as a compelling benchmark for real-time robotic perception due to its fast ball dynamics and stringent timing requirements. Accurate, high-frequency, and low-latency ball state estimation is critical for reliable trajectory prediction and timely control. Traditional frame-based cameras face an inherent trade-off: low frame rates leave temporal blind spots that miss fast-moving objects and high frame rates raise data and computational cost. Event cameras instead offer microsecond temporal resolution and, under sufficient illumination, remain largely free of motion blur even at high ball speeds. However, the community lacks large-scale datasets to develop and benchmark event-based perception in realistic sports scenarios. We address this gap by introducing the first large-scale event-camera dataset for table tennis, comprising over 1000 rallies from a diverse group of players ranging from amateurs to elite-level athletes. Each recording captures the event stream alongside 14 synchronized high-speed frame-based cameras at 200 FPS, which we use to produce 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. Building on this dataset, we train a convolutional neural network robust to background player motion that jointly estimates the ball's position and velocity in the image-plane from events. Treating the predicted velocity as an additional measurement in the Kalman filter reduces bounce-point prediction error by 36% relative to a position-only baseline. Finally, we close the perception-action loop by integrating the event-based system with a Stäubli robotic arm, enabling the first real-time human-robot table tennis rallies driven by event-based perception.
WOLF-VLA: Whole-Body Humanoid Optimal Locomotion Framework for Vision-Language-Action Learning
Vision-Language-Action (VLA) models have recently demonstrated strong generalization in robotic manipulation, yet their applicability to whole-body, contact-rich humanoid locomotion remains severely underexplored due to data scarcity, the absence of dynamically consistent demonstrations, and the difficulty of encoding optimality and safety in learning-based pipelines. This work introduces a unified framework WOLF-VLA that integrates whole-body optimal-control (OC) motion synthesis with large-scale multi-modal dataset to train VLAs capable of generating humanoid locomotion policies directly from natural-language instructions. We construct a comprehensive dataset of dynamically feasible humanoid trajectories across six locomotion-related task families, each parameterized by environmental variations, object colors, placements, and visual distractors. We train a VLA model using the collected joint trajectories, ego-centric visual observations and natural language instruction, yielding a policy that exhibits strong reasoning and robustness to initial-condition variability, and competitive performance across several tasks and environment settings. A systematic ablation study demonstrates the impact of each modality on the model performance. The full dataset, model checkpoints, and benchmarking simulation suite will be openly released, establishing a reproducible dynamically consistent benchmark for whole-body humanoid locomotion rich VLA control and enabling future research in scalable transfer of instruction-driven locomotion policies.
One Body, Two Minds: Variable Autonomy Approach for a Co-embodied Robotic Hand
Assistive robotic systems face a fundamental trade-off: fully autonomous systems lack user agency, while fully user-controlled systems demand continuous cognitive effort. Existing shared autonomy approaches blend human and robot commands but are mostly deployed in separate physical bodies. We introduce co-embodiment with variable autonomy, where human and robot share a single physical body and operate at different autonomy levels across task phases, from mutual autonomy during object search and grasping to human-dominant control during actuation. We present a co-embodied, wearable robotic hand that has its own ``mind'' and operates with variable autonomy levels. A learning-from-demonstration visuomotor diffusion policy enables autonomous grasping when the user positions the hand near known objects. Once grasped, the system signals completion and the human can actuate the grasped tool (drill, spray bottle, infrared thermometer, lighter, and ice-cream scoop) via hands-free head gestures. The human retains veto authority at all times through a release gesture that returns the system to the initial phase. Unlike blended autonomy, where control is continuously negotiated, our co-embodied approach consists of variable autonomy from full human control to full independent actions while maintaining physical coupling, realizing a one body, two minds paradigm. In a user study with 44 participants performing five bimanual tasks, users rapidly adapted to this ``two minds'' paradigm: completion times improved by 23.3% across trials ($p < 0.001$, Cohen's $d = 0.94$), the best-performing policy variant reached a 93.6% task success rate, and acceptance ratings were high (5.70/7 overall impression, 5.52/7 daily use willingness). This work establishes co-embodiment with variable autonomy as a viable approach for assistive robotics, enabling human-robot collaboration through co-embodiment.
ASSCG: Just-Right Gating over Chattering for Fast-Slow LLM Planning in Autonomous Driving
Large language models (LLMs) can improve autonomous driving planning but are costly to query online, and existing fast-slow planners often rely on hand-designed triggering rules that either over-call the slow system or call it at the wrong times. We formulate slow-system invocation as a resource-aware sequential decision problem and propose the Adaptive Slow-System Control Gate (ASSCG), which makes frame-level Query/Cache/Drop decisions to refresh, reuse, or suppress slow guidance. ASSCG uses an RWKV backbone for efficient long-horizon gating and is trained with supervised fine-tuning followed by GRPO-style compute-aware reinforcement fine-tuning. We apply ASSCG to two different fast-slow architectures: (i) AsyncDriver on nuPlan Hard20 closed-loop evaluation, where ASSCG improves score to 67.28 (+2.28) while reducing average end-to-end inference latency by 60%; and (ii) a RecogDrive-based dual system that we build by replacing its original VLM-2B module with a lightweight ViT-based fast planner and adding an LLM slow planner, evaluated on NAVSIM, where ASSCG achieves 91.4 PDMS (+0.6) and increases average speed by 25%. The project page, including video visualizations and additional results, is available at https://williamxuanyu.github.io/asscg/.
GROVE: Grounded Pedestrian Simulation via Natural Language for Interactive Social Robot Navigation IROS 26
Pedestrian simulation is a critical component for training and deploying social robot navigation approaches, yet it remains a largely rigid system that repeatedly requires manual data generation to define even simple scenarios. We propose GROVE, a text-to-scenario pedestrian simulation framework that combines state-of-the-art approaches to produce realistic, socially challenging scenarios for social robot navigation. Our framework allows users to customize one of several common presets (emergency, queuing, normal) or even enter a fully independent prompt to generate a highly customizable pedestrian simulation. Multiple modules separately ensure the realism and soundness of long-horizon human behavior, medium-horizon pedestrian navigation, and short-horizon robot/social interactions. Each module is tuned by the prompt in a way that reflects the user intent across all aspects of pedestrian simulation. By dynamically selecting one of several state-of-the-art (SotA) approaches in our modules based on the scenario, we capture many situational nuances of pedestrian behavior in order to narrow the simulation-to-real (sim2real) gap. The human simulation is directly integrated into Isaac Sim, Gazebo, and RViz simulators for robot deployment in highly social environments. We validate our approach through qualitative comparison against existing pedestrian simulation baselines across scenarios of varying complexity in residential, hospital, and office environments. The result is a high-fidelity pedestrian simulation that challenges social robot navigation with complex, diverse, realistic human behaviors.
comment: Accepted at IROS 26
AISPO: Enhancing Depth Reliability for Robotic Manipulation of Non-Lambertian Objects via Affine-Invariant Shape Prior
Reliable depth perception is critical for robotic manipulation, especially for non-Lambertian objects such as transparent or highly specular surfaces, where raw depth measurements are often corrupted or missing. These failures frequently propagate to motion planning, resulting in invalid grasp poses and execution errors. We propose AISPO, a depth completion framework that improves depth reliability for manipulation in challenging sensing conditions. AISPO combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to enforce geometric consistency and mitigate catastrophic depth failures. Unlike methods that focus primarily on average depth accuracy, our approach emphasizes physical plausibility and structural integrity of the predicted depth maps. Extensive benchmark evaluations demonstrate competitive performance and strong generalization to unseen objects and novel scenes. Real-world grasping experiments further show that enhanced depth reliability significantly improves manipulation success rates, particularly for transparent objects where many existing methods fail to produce physically usable depth estimates.
comment: Published in IEEE Robotics and Automation Letters. 8 pages. Accepted April 2026
SAGE-Nav: Leveraging LLM Planning and Alignment Fusion for Hierarchical Scene Graph-Guided Navigation IROS 2026
Object-Goal Navigation (ObjNav) requires embodied agents to autonomously locate specified targets using only egocentric visual observations. Existing monolithic methods struggle with long-horizon reasoning and generalize poorly to novel environments. To address these limitations, we propose SAGE-Nav, a novel hierarchical framework that integrates the reasoning capabilities of Large Language Models (LLMs) with dynamic scene graphs. Crucially, it decouples asynchronous global semantic planning from the high-frequency reactive control loop. The LLM serves as a global planner, decomposing abstract instructions into a sequence of semantically grounded waypoints. To translate these plans into dense multi-modal guidance, we design a Hierarchical Scene Graph Encoder (HSGE) that leverages relational graph convolutions to produce structure-aware embeddings preserving both semantic and spatial topology. Furthermore, we develop the Goal-aware Alignment-Fusion Network (GAFN) to dynamically fuse real-time perception with these structural priors. Using an adaptive gating mechanism with an explicit inductive bias, GAFN ensures robust visual-topological alignment for the low-level policy. Extensive evaluations in the i-THOR and RoboTHOR environments demonstrate that SAGE-Nav achieves state-of-the-art performance, delivering substantial gains in navigation efficiency and zero-shot generalization while maintaining the low control latency required for physical robotic deployment.
comment: Accepted by IROS 2026
Generative AI for Safe and Photorealistic Drone Light Shows
Drone light shows are redefining aerial entertainment, yet their widespread adoption is bottlenecked by labor-intensive, manual animation. While generative AI promises an automated alternative, current frameworks fail to provide photorealism with fluid, dynamic motion. To address this limitation, we introduce SWAN, an end-to-end pipeline that synthesizes photorealistic, large-scale, and collision-free drone choreographies directly from text prompts. SWAN converts text into realistic reference videos and translates these pixel-space dynamics into physical swarm kinematics using a novel, adaptive point-tracking algorithm. Unlike existing trackers, this method maintains spatial coherence through severe occlusions and rapid topological shifts. A dedicated planner then allocates these trajectories to individual drones, while a subsequent safety filter ensures collision-free execution. We demonstrate scalability by safely orchestrating simulated 2,000-drone formations and validate physical feasibility on a dense real-world swarm of 49 quadcopters, operating everything entirely on standard consumer hardware. Combined, this work demonstrates how generative AI can be leveraged to automate multi-robot choreography design, providing an accessible new framework for drone light shows.
Delta-Position Estimation-Based IMU Odometry: A Comparison of MLP and Kolmogorov-Arnold Networks
In this study, the learning-based inertial odometry problem is investigated using raw IMU measurements obtained from the EuRoC MAV benchmark dataset. Instead of absolute position regression-a formulation that may lead to large constant errors-the models are trained to estimate the incremental displacement (Δp) over a fixed 50 ms sliding window, and the full trajectory is reconstructed through numerical integration. A standard Multi-Layer Perceptron (MLP) is compared with a Kolmogorov-Arnold Network (KAN) equipped with learnable B-spline activations. Although KAN has 6.9 times fewer parameters than MLP (8,444 versus 57,859), it produces a 44% lower error in terms of final cumulative drift on the test trajectory (9.61 m versus 17.23 m). In addition, KAN exhibits more stable behavior in terms of long-term error accumulation, with lower P_50 and P_90 cumulative drift values. These findings indicate that learnable B-spline-based activations have the potential to reduce error accumulation in the inertial odometry problem.
comment: This study was presented at the 11th International Congress on Engineering Sciences and Multidisciplinary Approaches, held in Istanbul, Türkiye
HEART: Coordination of Heterogeneous Expert Agents for Physically Grounded Robotic Task Planning
Large Language Models (LLMs) can reason over complex instructions but often fail to satisfy the physical and spatial constraints required for robotic task planning. Recent LLM-based planners directly translate text into action sequences, yet they lack structured reasoning about feasibility, reachability, and logical order, resulting in invalid or incomplete plans. We present a heterogeneous multi-LLM framework that decomposes instructions into atomic reasoning tasks and allocates them to role-specialized expert agents under a token budget for real-world computational and communicational constraints. By combining role-oriented reasoning from heterogeneous agents followed by constraint-driven plan synthesis, HEART validates capability, reachability, and constraint conditions before planning and helps produce physically executable plans while maintaining efficiency. Experiments across different household benchmarks show that HEART consistently improves plan success compared to single-LLM and rule-based planners, demonstrating that heterogeneous LLM collaboration enables robust and scalable robotic task planning under resource constraints.
comment: 9 pages, 3 figures
MAPL: Multi-Objective Preference Learning for Robot Locomotion
Reward design remains a major bottleneck in reinforcement learning for robot locomotion, where successful policies often depend on carefully tuned, task-specific reward functions. Preference-based reinforcement learning offers an alternative, but existing LLM-based methods typically ask for a single overall judgment between behaviors, making it difficult to capture the multiple competing objectives that underlie high-quality locomotion. We present Multi-Objective AI-Informed Preference Learning (MAPL), a framework that learns locomotion rewards from high-level natural language objectives rather than manually engineered reward equations. MAPL prompts a large language model to compare trajectories independently along semantically meaningful criteria, using generic language descriptions that are terrain-invariant and require little domain expertise. These objective-wise preferences are used to train a multi-head preference scoring model, whose outputs are aggregated to form a scalar reward for policy optimization. Across four quadruped locomotion environments, MAPL trains policies using only LLM-generated preferences and achieves performance comparable to or better than expert-designed rewards, while eliminating task-specific reward engineering.
Large-Scale Tunnel Air--Ground Collaboration With FLISP: Fast LiDAR-IMU Synchronized Path Planne
Hydropower tunnel inspection is critical for infrastructure integrity yet remains inefficient and hazardous using manual methods. We propose FLISP (Fast LiDAR-IMU Synchronized Path Planner), a mapless planning framework for cooperative UGV-UAV inspection. Unlike traditional map-based paradigms, FLISP features three core contributions: (1) a unified architecture where a single UGV-mounted LiDAR-IMU suite drives synchronized path generation for both platforms; (2) platform-specific solvers utilizing an enhanced Firefly Algorithm for UGV obstacle avoidance and a dynamic iterative optimizer for UAV flight; and (3) a hierarchical refinement strategy ensuring kinematic feasibility without state estimation drift. Benchmarks in a 1.2 km operational tunnel demonstrate that FLISP circumvents structural bottlenecks of map-based methods, eliminating map rasterization overhead (Fast-LIO2 + A*) and sampling instability (LIO-SAM + RRT*). FLISP achieves a 100% success rate with ~7 ms latency, representing a 7-fold speedup over grid-based and three-order-of-magnitude improvement over sampling-based baselines. Validated in operational hydropower tunnels, this approach offers a scalable solution for robotic inspection in feature-degraded linear infrastructure. A demonstration video is available at https://youtu.be/Y_ezs1PfLJ4, and the code at https://github.com/ArchibaldGuo/FLISP.git.
comment: 24 pages, 31 figures, 5 tables. Author accepted manuscript. This work was supported by the State Key Laboratory of Autonomous Intelligent Unmanned Systems. The authors also thank the KinaMind Society for its inspiring environment and support
Commerge: Communication-Efficient, Robust, and Fast LiDAR Map Merging Framework for Multi-Robot Coordination in Resource-Constrained Scenarios
By maintaining global consistency across robot teams, multi-robot LiDAR map merging enables faster exploration and efficient area coverage. However, map merging requires exchanging massive sensor data between the server and robots, making communication the bottleneck, especially in communication-constrained environments. Therefore, we present Commerge, a communication-efficient map merging framework that achieves bandwidth reduction through graph-theoretic selective data exchange. By doing so, our Commerge reduces inter-robot communication by up to 5,000x while maintaining alignment accuracy. Our key insight is that only a small subset of carefully selected scans is sufficient for robust map merging. We formulate this as a three-stage cascaded optimization problem on an exchange graph, where vertices represent robot keyframes and edges denote candidate inter-robot loops. Through three cascade stages, we select a sequentially overlapped, balanced-transmission-cost, and geometrically-perceptually optimal scan subset that preserves alignment quality while reducing communication. Unlike existing approaches that either transmit whole scans, which require GB-scale data exchange, or employ naive downsampling, our approach exchanges only MB-scale data while achieving comparable alignment accuracy. Extensive evaluation on five public datasets and four in-house datasets covering cave, planetary-analog, indoor, and outdoor campus environments shows up to 99.98% reduction in data exchange (e.g., from 7,000MB to 1.3MB on the HeLiPR dataset), while maintaining alignment performance across embedded to desktop platforms. The supplementary materials are available at https://sparolab.github.io/research/commerge.
comment: 32 pages, 32 figures
Reliability-Asymmetric Spacecraft Autonomy: Co-Designing a Capable Learned GNC Stack with a Verified, Adaptation-Aware Runtime Shield
Deep-space missions need onboard autonomy that is both capable and certifiable. Rule-based autonomy is certifiable but brittle, while learned autonomy is capable but hard to verify. We present AMPLE-GNC, a three-tier guidance, navigation, and control stack. Its capability path combines a small foundation-model commander that maps natural language to PDDL+, a constraint-screening verifier, and a fault-adaptive controller. All three are bounded by a runtime shield with nine linear-temporal-logic invariants whose predictor soundness is machine-checked by the Kind 2 model checker. On a 6-DOF Basilisk testbed, we make three contributions. First, we deploy an edge commander. Fine-tuning a pretrained 360M model with grammar-constrained decoding gives a hard output-validity guarantee and 84% planner-executable actions. On a de-leaked test, novel-phrasing generalization is 38% exact and 51% action, rising to 48% exact after phrasing-diversity re-finetuning; we separate syntactic validity from semantic accuracy. Second, we introduce a fault-adaptive controller. Rapid Motor Adaptation infers latent actuator faults online and recovers 97.8% of actuator-sign faults and 94.4% of continuous-gain faults within the training randomization envelope. Fault-unaware PD and from-scratch end-to-end RL both score 0%, while the strongest classical-adaptive baseline reaches 55% on continuous gain. Beyond the envelope, a split-conformant retrain scores 57-67%, and adding 4x more in-regime data worsens performance, showing that randomization breadth, not data volume, drives generalization. Robustness is flat under star-tracker noise to 0.005. Third, we show that a latching safe-hold shield can suppress even a capable controller. A split-conformal recovery-deadline certificate with adaptation-aware engagement reconciles safety and recovery, keeping the controller 94.5% autonomous while still catching non-recovery.
Decoupling Semantics and Geometric Grounding: Spatial Visual Prompts for Language-Conditioned Imitation Learning
While end-to-end Vision-Language-Action (VLA) models show promise in robotic manipulation, their monolithic paradigm inherently couples semantic reasoning and spatial control. This creates a severe alignment bottleneck, limiting precise target disambiguation in data-constrained imitation learning. To overcome this, we propose SVP-IL, a decoupled architecture that explicitly extracts spatial visual grounding from the action generation loop. By leveraging vision-language foundation models, we parse instructions into zero-shot geometric masks, translating language into explicit Spatial Visual Prompts (SVP). These priors are injected into a continuous action generator via a lightweight direct feature-level fusion mechanism. This integration provides explicit and uncorrupted spatial gradient guidance while ensuring highly stable optimization under low-data regimes. Extensive experiments demonstrate that SVP-IL significantly outperforms state-of-the-art VLAs and pure visuomotor baselines. Trained on as few as 50 to 100 demonstrations, SVP-IL improves average success rates on highly ambiguous language-conditioned tasks from 24.0% to 39.5%, achieving 67.8% on standard benchmarks. Real-world robotic experiments further validate its robustness and data efficiency in unstructured physical environments.
Self Capacitive Tactile Sensor System designed for Companion Robots
Tactile sensing is essential for humanoid robots to achieve safe physical interaction, dexterous manipulation, and truly human-like responsiveness. However, the design of such systems remains challenging. Conventional approaches often suffer from complex multilayer structures, intricate wiring, high cost, and poor scalability, making it difficult to realize full-body tactile sensing with real-time, low-latency detection while maintaining minimal computational load on the robot's main processor. In this work, we present a simple, scalable and hardware friendly tactile sensing system for a companion humanoid robot based on the self-capacitance principle. The proposed sensor system employs a single conductive fabric layer with a conductive fabric wire architecture and does not require intricate electrode patterning. Scalability was demonstrated by fabricating a 100-point sensor array on a flexible printed circuit (FPC). Evaluation across sampling frequencies showed that 10 Hz is insufficient and misses transient events, whereas 100 Hz and 1000 Hz reliably capture and clearly distinguish all interaction types: gentle touch, slow tapping, fast tapping, and hitting. A decision-tree classifier was implemented directly on the FPGA, offloading real-time inference from the Raspberry Pi 4 with minimal latency and negligible power overhead. This design fully meets the tactile sensing requirements of the HIRO-chan robot and is well-suited for full-body tactile sensing in HIRO-chan and other companion robots.
AI Coaching for Accelerating Human Skill Development with Reinforcement Learning
AI copilots can substantially boost human performance through shared control, but excessive assistance can induce over-reliance and skill atrophy. This paper studies how an embodied AI agent can act as a coach that accelerates human motor-skill development. We argue that effective coaching requires strategic scaffolding and stepping back that are aligned with the learner's capability, allowing productive failures that drive learning. We formalize the interactive AI coaching process as a non-cooperative dynamic game in which the learner optimizes task performance while the coach targets the learner's independent competence. Building on this formalism, we develop a reinforcement learning framework combining adaptive shared control with probabilistic models of the coach's causal influence on skill evolution, enabling tractable training of coaching policies. A comprehensive user study (N=33) on first-person-view drone racing shows significant gains in human learning outcomes over state-of-the-art AI coaching baselines.
WaveForward: An Omnidirectional Passive Wheeled Quadruped Robot with Casters
Wheeled-legged robots possess both agile mobility for traversing complex terrains and high efficiency, making them suitable for long-distance transportation applications. Conventional actuated wheeled robots require specialized hardware and electrical design due to the incorporation of wheel components. We propose a novel and low-cost passive wheeled legged robot equipped with standard casters on each leg to obtain omnidirectional mobility. The control method employs an asymmetric actor-critic structure, enabling the utilization of the privileged information of the passive caster's angles and velocities. We develop a caster base posture adjustment strategy based on velocity commands, utilizing actuated joints to modify the caster base joint axis posture and thereby adjust the propulsion direction of the casters. Moreover, we implemented multiple propulsion modes to achieve varying degrees of caster twisting oscillation, converting these into propulsive force. We conducted a slalom test and mode switch experience, which shows the passive wheeled quadruped could achieve omnidirectional movement versatility, and reduce the cost of transport (COT) by up to 89.1% with respect to legged motion.
comment: 8 pages,11 figures
DynaMOMA: Instantaneous Prediction of Grasp Poses for Mobile Manipulation of Dynamic Objects
Mobile manipulation is a fundamental robotics task and has advanced rapidly in recent years, enabling robots to navigate, reach, and interact with objects in complex environments. However, mobile manipulation of dynamic objects remains highly challenging, as robots must coordinate the mobile base and arm while adapting to continuously evolving target poses. A key challenge lies in predicting temporally consistent short-horizon grasp trajectories from dynamic observations. In this work, we propose \ours{}, a dynamic mobile manipulation framework that couples instantaneous grasp trajectory prediction with whole-body control policy. Our predictor uses an anchor-based diffusion model to generate temporally consistent short-horizon grasp trajectories conditioned on historical observations. The predicted trajectories are then encoded as compact features and fed to a whole-body reinforcement learning policy, which controls the mobile manipulator for dynamic grasping. We further introduce a anticipation-guided reward that equips the policy with an anticipatory grasping horizon by adaptively shifting the target from the current grasp observation to the instantaneously predicted grasp trajectory. Through extensive experiments in Isaac Gym simulation, we show that our method achieves strong performance in mobile manipulation of dynamic objects across diverse settings and grasping metrics. Furthermore, our predictor and policy demonstrate strong generalizability in real-world experiments.
An Integrated Hardware-Software Design for Low-Data Spatial Defect Detection in Robotic Visual Inspection with Hybrid Optoelectronic Neural Networks
To address data overload and inefficient shape-level annotation in robotic visual inspection, this paper proposes a hardware-software integrated optoelectronic architecture. A non-imaging, low-data paradigm is established to minimize annotation dependency. First, a sensor-in-the-loop strategy reconfigures a Digital Micromirror Device (DMD) as a physical optical convolutional layer, enabling photonic-domain feature extraction that unifies sensing hardware and processing software. To suppress data volume at the source, a block-based compressed sensing strategy encodes spatial information into low-dimensional temporal signals, drastically reducing redundancy. Subsequently, to bypass laborious manual defect shape annotation, natural language descriptions guide the network to align with highly generalizable features from Contrastive Language-Image Pre-training (CLIP), steering the attention maps of the optoelectronic neural network toward defect shapes. Furthermore, a Localization Accuracy for Attention (LAA) metric is proposed to quantify shape-level defect localization performance. Experiments on transparent material defect detection validate the system's effectiveness. Parametric analysis reveals how measurement matrices, compression ratios, and block sizes affect accuracy. Results show that, compared to traditional imaging, the proposed architecture maintains equivalent accuracy while reducing data volume by 90% for Vision Transformers and computational workload by 60% for Convolutional Neural Networks. This low-data paradigm offers an efficient solution for industrial automation scenarios involving massive data streams, high acquisition costs, or constrained edge resources.
WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation
A robot working alongside people must reason about what they have done, in what order, and with what intent. Video carries the spatial layouts, object histories, and gestures that language leaves underspecified, yet today's manipulation benchmarks pair an instruction with a single current image, offering no way to evaluate reasoning over observed human behavior. We introduce WatchAct, a benchmark for robot manipulation grounded in observed human behavior. Each instance pairs a real-world human-action video and a language instruction with an aligned simulator scene and an executable LIBERO task, enabling scalable and reproducible evaluation. WatchAct comprises 3,000 long-horizon instances across 14 tasks in four capability domains drawn from the cognitive demands of watching another agent: parsing events (Event Grounding), recovering procedural structure (Procedural Reasoning), inferring unstated intent (Implicit Intent Inference), and tracking how the scene was changed (Episodic Reasoning). We further propose a disentangled evaluation protocol that separately measures (i)~video-to-plan reasoning by vision-language models, (ii)~policy execution under oracle plans, and (iii)~full task completion by integrated planner--policy pipelines. In both simulation and on a Franka Research 3 robot, current systems remain far from solving WatchAct. The best pipeline, Gemini-3.1-Pro with $π_{0.5}$, reaches only 16.3% Success Rate (SR) in simulation and 14.0% on the real robot. Gemini-3.1-Pro attains just 36.8% Plan SR (vs. 97.1% for humans), while $π_{0.5}$ reaches only 21.5% Task SR under oracle plans and drops to 10.6% on out-of-domain scenarios. Dataset and code are available at https://baiqi-li.github.io/watchact_page/.
Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?
Multi-fingered robots promise the speed and dexterity of human hands, yet challenging problems such as precise assembly have remained out of reach. These tasks are contact-rich, making data collection for imitation learning difficult, and sparse-reward, making direct exploration with reinforcement learning (RL) intractable. Consequently, prior work has made progress by structuring the problem with specialized grippers, tool attachments, and environment fixtures. In this work, we argue that before a robot can perfect precise assembly, it must first learn to play. We further ask the question: what factors in the process of learning to play matter for precise assembly? We propose Play2Perfect, an RL framework for task-agnostic pretraining through play on diverse objects and goals, which is then perfected on precise assembly. The goal of play is to acquire reusable manipulation priors, such as grasping, in-hand reorientation and pose reaching. Finetuning then adapts this general prior to assembly, focusing exploration on the final contact-rich, high-precision interactions needed for success. We systematically study key design choices in play pretraining, including object diversity, training objective, trajectory diversity, and goal precision. We show that our prior is 33x more sample-efficient than RL training from scratch, even when provided with dense, multi-stage rewards. We demonstrate zero-shot sim-to-real transfer, achieving 60% success on tight insertions with only 0.5 mm contact clearance, and over 50% success on long-horizon multi-part assembly and screwing.
comment: 22 pages, 12 figures, 4 tables. Project page: https://play2perfect.github.io
A System for Fast, Resilient, and Adaptable Loco-Manipulation Behaviors on Humanoid Robots
Humanoid robots could take on physically demanding, hazardous, and repetitive work in spaces built for humans. However, a useful robot for these spaces must coordinate locomotion, whole body motion, perception, contact, and operator supervision. This thesis presents a robot-local, runtime-editable behavior authoring and runtime system. Our system strives to be maximally observable, predictable, and directable following Coactive Design principles developed during the DARPA Robotics Challenge. Our operator interface remains continuously synchronized to the robot for runtime authoring, monitoring, and repair. Our behavior architecture uniquely combines object-centric Affordance Templates, organization and logic inspired by Behavior Trees, and runtime-editable perception through a behavior scene and primitive scene actions. Action primitives build on a whole-body controller that supports moving the arms while walking, and use a concurrent action layering algorithm for speed. The behavior library developed during this work covers more than twenty real-robot task variants, including push and pull doors with knob, push-bar, and lever-handle mechanisms, multi-step exploration sequences, obstacle clearing, and reactive table-to-table manipulation tasks. This behavior system has been deployed on many humanoid robots, such as Boston Dynamics' DRC Atlas, NASA's Valkyrie, IHMC and Boardwalk Robotics' Nadia, Unitree's H1-2, and IHMC's Alex. We evaluate our system across capability, speed, reliability, and speed of behavior creation, adaptation, extension, and combination. Our experiments demonstrate that we can adapt, extend, and combine existing behaviors to create novel loco-manipulation behaviors in minutes or hours. Videos: https://www.youtube.com/playlist?list=PLJK5CTyotYqsfgfnXb-09YNFeBose6uEY.
comment: PhD dissertation, University of West Florida, Department of Intelligent Systems and Robotics, June 2026. 331 pages
Rethinking Training & Inference for Forecasting: Linking Winner-Take-All back to GMMs ECCV 2026
Trajectory forecasting for autonomous driving has advanced rapidly, yet representative models often produce uninformative posteriors over forecast modes, causing problems for mode pruning. We trace this to a modeling-training mismatch: forecasters are typically modeled as conditional Gaussian mixture models (GMMs) but trained with a winner-take-all (WTA) loss that assigns each sample to its nearest mode. We argue that this K-means-like hard assignment (one-hot), while preventing mode collapse, is the source of uninformative mode probabilities: it over-segments the trajectory space, ignores relatedness among nearby modes, and yields assignment instability under small perturbations. Guided by this lens, we introduce two post-hoc treatments: (1) test-time posterior-weighted merging that aggregates nearby candidate trajectories; and (2) a one-step expectation-maximization (EM) update that replaces hard labels with soft responsibilities, sharing probability mass across neighboring modes. Across several WTA-trained architectures, these lightweight steps produce more informative, faithfully ranked mode posteriors and strengthen final forecasts on popular displacement metrics -- without retraining. Our analysis unifies recent design choices through a GMM-vs-K-means perspective and offers principled, practical corrections that better align training objectives with inference.
comment: Accepted by ECCV 2026
CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
Long-horizon, contact-rich complex manipulation tasks, such as seating a GPU into a PCIe slot, demand both millimeter high precision and out-of-the-box generalization to new tasks. Existing paradigms struggle to satisfy both: classical pipelines use brittle, task-specific interfaces to achieve high-precision control but require costly pipeline redesigns to adapt to new tasks, whereas monolithic end-to-end policies provide better generalization but lack high precision on complex, out-of-distribution tasks unless retrained with new data. Both paradigms share an implicit assumption: once a manipulation capability is acquired, it must be deployed as a rigid pipeline or monolithic whole, rather than being freely decomposed and recomposed. In this paper, we show that complex manipulation capabilities can emerge naturally from the composition of simple, independent behaviors. Rather than deploying a monolithic policy or a rigid pipeline, we propose \ourshort, a framework orchestrating foundation models and diverse sensing modalities into multiple composable core behaviors: a semantic behavior extracting spatial constraints via foundation models; a predictive behavior forecasting trajectories by tracking keypoints in imagined videos; and a reactive behavior providing high-frequency tactile and force corrections. On a shared $SE(3)$ interface, these outputs compose by right-multiplication into a single pose command at each control step, executed by a compliant controller. We demonstrate \ourshort on 8 real-world tasks spanning everyday manipulation and precision assembly, with the strongest gains in contact-rich assembly and object transfer, and show robust recovery from manual perturbations during execution. {Website:} https://costream-simple.github.io
comment: Website: https://costream-simple.github.io
PRISM: Efficient and Locally Optimal Probabilistic Planning with Reachability Guarantees
Belief-space planning under motion uncertainty and state and control constraints remains a fundamental challenge, largely due to the difficulty of establishing reachability guarantees in constrained belief spaces. Existing constrained belief-space planners rely on sampling to construct multi-query belief roadmaps and explicitly find feasible trajectories between sampled nodes to establish reachability. These methods often struggle to cover the belief space or use robust control techniques that improve coverage at the cost of indirect, high-cost trajectories; they also lack finite-time or finite-memory completeness guarantees. We propose PRISM, a multi-query motion planning algorithm for belief spaces with state and control constraints that targets both high coverage and low cost. We present a new result on controllability of the state covariance under constraints, which is used by PRISM to decompose belief-space planning into deterministic mean planning and covariance shrinking. PRISM further includes an online local optimization method that reduces the cost of feasible belief-space trajectories. Under mild assumptions on the start and goal distributions, we prove that PRISM guarantees full coverage (i.e. completeness) despite actuator and obstacle constraints. In challenging simulated scenarios, PRISM achieves substantially higher roadmap coverage than state-of-the-art belief-space planning methods while producing trajectories with lower mean cost and cost variance. For example, PRISM achieves 100% coverage in easy and medium-difficulty scenarios, and, in the hardest scenario, which violates PRISM's coverage assumptions, it still achieves 97-100% coverage, while all other methods achieve less than 45%.
Exploring the Intrinsic Geometry of Diffusion Models with Constrained Inverse Kinematics
Recent studies suggest that diffusion models can recover geometric structure in the data manifolds they are trained on, yet the supporting evidence has so far come mostly from natural-image data, where the underlying geometry itself is unknown. We study this question in a setting where the geometry is analytically tractable: constrained inverse kinematics (IK). Each task-space constraint defines a configuration-space manifold with known intrinsic dimension, giving direct ground truth for evaluating the geometry learned by the model. For each of the 6-DoF UR5 and 7-DoF Franka, we train a single conditional diffusion model across seven constraint families, spanning solution manifolds from discrete IK branches to self-motion manifolds. Our empirical results reveal that the intrinsic dimension recovered from the model's score function matches the analytical degrees of freedom of the corresponding constraint manifold across both robots. Moreover, linear interpolation in the latent space leads to generated solutions that remain close to the appropriate constraint manifold, indicating that the learned representation further captures geometric structure of the constraint family beyond intrinsic dimension alone. Constrained IK therefore offers a controlled setting for studying the intrinsic geometry learned by diffusion models.
comment: 21 pages, 12 figures, 10 tables
MPC-Injection: Biasing Off-Policy Locomotion RL Toward Controller-Induced Behavior Basins
Reinforcement learning (RL) for locomotion frequently converges to locally optimal but undeployable behaviors, such as vibrating limbs or scooting on the torso, that maximize return without producing a usable gait. We present MPC-Injection, a low-overhead method that steers RL toward a designer-preferred gait by inserting transitions into the replay buffer from a model predictive controller solving the same Markov decision process. Unlike reward shaping, MPC-Injection does not require redesigning the task reward, and unlike adversarial imitation learning, it adds no discriminator, no kinematic retargeting, and no auxiliary objective. Instead, the controller's preferred behavior is transferred to the policy purely through the replay state distribution. On a 2D walker in simulation and with sim-to-real evaluation on a Go2 quadruped, we show that MPC-Injection drives the policy into the controller's behavior basin using a one to two-term task reward, producing gaits qualitatively comparable to those of reward shaping with twenty-one tuned terms and of adversarial motion priors without their discriminator and retargeting overhead. We further analyze how the injected transitions bias actor-critic updates toward controller-visited states, allowing the policy to learn behaviors that pure RL may fail to reach under simple reward functions.
comment: 22 pages, 10 figures
Charting the Growth of Social-Physical HRI (spHRI): A Systematic Review Pipeline Augmented by Small Language Models
Social-physical human-robot interaction (spHRI) has grown rapidly across robotics, human-computer interaction, human-robot interaction, and haptics. Yet, fragmented terminology and inconsistent methodologies make systematic synthesis difficult. To support scalable review practices, we evaluated the extent to which small language models (SLMs; < 1.5B parameters) can assist with title and abstract screening for a large spHRI systematic review. While no SLMs matched human reviewers' performance, the models operated locally and screened papers orders of magnitude faster. The combined SLM ensemble identified 39 papers reviewers missed, representing 10.29% of the final relevant dataset. These results demonstrate that SLMs can augment, rather than replace, expert reviewers and make large-scale literature reviews accessible and sustainable.
comment: 5 pages, 3 figures, 2 tables, Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
Scaling Nonlinear Optimization: Many Problems One GPU
Many robotics problems, including trajectory optimization, inverse kinematics, and contact-rich motion planning, reduce to nonlinear programs (NLPs). Mature NLP solvers such as IPOPT can solve these problems, offering hard constraint satisfaction, optimality guarantees, and favorable scaling with problem dimension. These solvers underpin gradient-based methods in robotics, yet remain CPU-bound and solve only one problem at a time, preventing their integration into GPU-batched learning pipelines. On the other hand, sampling-based approaches such as reinforcement learning, model predictive path integral, and imitation learning have become the core of modern robotics research due to their ability to leverage GPU-batched simulators. These simulators can generate orders of magnitude more dynamics rollouts per second than was previously possible. If a GPU-batched NLP solver existed, it would unlock similar speedups in the number of constrained, locally optimal solutions generated per second. This regime of solving many problems concurrently versus solving a single problem at a time is a key requirement for integrating NLP solvers in modern GPU-batched robotics frameworks. To this end, we introduce \texttt{jaxipm}, the first GPU-batched NLP solver, based on IPOPT, and implemented in JAX. We accomplish this by redesigning IPOPT's algorithm to eliminate control flow with \textit{heterogeneous iteration fusion}, and by minimizing GPU idle time with \textit{iteration level batching}. We evaluate \texttt{jaxipm} on a variety of quadrotor nonlinear model predictive control benchmarks, including reference tracking in the presence of obstacles, multi-quadrotor navigation without collision, and navigation in a cluttered environment. We demonstrate up to a $32.85\times$ increase in throughput over IPOPT. Our complete open-source codebase is available at https://github.com/johnviljoen/jaxipm.
comment: 8 pages, 6 figures, ieeeconf style, submission for RA-L
KRVF: A Source-Aware Semantic Voxel World Representation for Edge Mobile Manipulation
Mobile manipulators need world models that are current, queryable, semantically meaningful, and usable under edge-compute constraints. This technical report presents KRVF, a source-aware semantic voxel world representation for edge mobile manipulation. Unlike reconstruction-centric mapping pipelines that primarily optimize global geometric fidelity, KRVF represents local world state as task-oriented voxels that encode occupancy, color, semantic evidence, temporal freshness, and evidence source. The representation separates measured occupancy from semantic-prior hypotheses, enabling depth-failure-aware object reasoning without silently corrupting persistent geometry. KRVF also closes a feedback loop between mapping and sensing by rendering map-prior depth for repair, and exposes task-level query operators for semantic objects and grasp candidates. The report formalizes the KRVF representation and documents a ROS 2 implementation that turns online RGB-D observations into a task-facing robot memory.
comment: Technical report, 9 pages, 3 figures
Layered Outer-Loop Control for Disturbance-Robust Multi-Waypoint UAV Arrival
Disturbance-robust UAV position control is easy to demonstrate in benign simulations but much harder to make fast in approach, well behaved near the target, and credible beyond a single benchmark. This letter presents a layered terminal-control architecture for multi-waypoint UAV position regulation together with a staged evaluation across PyBullet, PX4/Gazebo, and hardware. Phase I uses a PyBullet benchmark with stochastic wind for rapid structural selection, identifying a controller core that separates smooth approach generation, persistent-bias compensation, and supervised near-target terminal regulation. Phase II carries only that main architecture into a more demanding PX4/Gazebo closed loop, where the outer-loop controller acts through a cascaded flight stack with delay-sensitive settling and stronger transit-to-hover coupling. This step exposes which benchmark gains survive autopilot-mediated dynamics and which refinements collapse once the loop becomes more deployment-like. In Phase I, the bare controller attains 0.024 m mean late-stage wind error. In Phase II, the final controller is selected using a transfer-oriented rule emphasizing absence of benchmark priors, cross-scenario balance, and deployable supervisory logic. Strict is used as the primary reporting reference; the supplementary retrospective Grace analysis shows that part of the residual failure set is sensitive to completion semantics rather than gross waypoint-miss behaviour. The evaluation is completed on one Vicon-tracked Tello stack through a two-level hardware study. Taken together, the results suggest that benchmark success becomes more informative when the main controller design is separated from benchmark-specific refinement and remains defensible under harder closed-loop evaluation.
comment: Preprint, 12 pages, 5 figures, 5 tables
Racing a Wheeled Quadruped: Active Load Transfer Mitigation via Model Predictive Control
This paper presents a hierarchical control framework using model predictive control (MPC) and reinforcement learning (RL) for active roll control to manage lateral load transfer during autonomous racing of a wheeled quadruped. The framework integrates offline time-optimal raceline generation, an online MPC planner that actively minimizes the lateral Load Transfer Ratio (LTR), and a low-level, whole-body RL policy deployed directly onto the robot's 16 actuators. The MPC is based on a vehicle dynamics bicycle model of the Unitree Go2-W platform. The robot's leg actuators act as active suspension where knee joints generate anti-roll torque to bank into turns. Physical track experiments demonstrate that active roll control reduces mean LTR by up to 44%, improves the fastest lap time by 8.7%, and boosts peak lateral acceleration capability by 21.3% to 1.98 $m/s^2$, maintaining robust high-speed stability beyond the range of a non-tilting baseline controller. Supplementary code and video can be found at https://github.com/meisman-ucb/go2w-roll-control-mpc
comment: Accepted to the 17th International Symposium on Advanced Vehicle Control (AVEC 2026), 7-11 September 2026, Tsukuba, Japan
NavIsaacLab: Generating Realistic Crowd via Parallel Robot Learning for Benchmarking Human-aware Navigation
Robot autonomous navigation that accounts for surrounding human activities is crucial for ensuring both safety and natural human-robot interaction in real-world environments shared by humans and robots. Simulation of complex and diverse navigation scenarios serves as the foundation for training reliable robot navigation policies and accurately evaluating the performance of algorithms, offering an efficient alternative to manual supervision of real data. However, current human-aware navigation research faces significant challenges due to the scarcity of diverse, high-quality scene data. Existing simulation platforms often rely on handcrafted rules to approximate pedestrian behavior and lack the capability to provide extensive sensor signals, typically assuming perfect observations. To address these limitations, this paper presents NavIsaacLab, a comprehensive framework for benchmarking and training human-aware navigation policies through physics-based and photo-realistic simulations of pedestrians and scenes. Based on Isaac Lab, the proposed framework employs photo-realistic scene rendering capabilities and supports parallel simulation on GPU, delivering real-time and accurate 3D visual feedback to robots. To enhance the realism of human behavior, a data-driven approach is employed that incorporates a trajectory diffusion model and an adversarial motion learning controller, enabling controllable, physics-based pedestrian simulation. Furthermore, the integration of diverse cross-scale scenes provides a robust benchmark for state-of-the-art human-aware navigation methods.
Fast LeWorldModel
Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition model. This autoregressive rollout makes planning computationally expensive and exposes the predicted trajectory to accumulated latent errors as the horizon grows. We propose Fast LeWorldModel (Fast-LeWM), a fast latent world model that replaces repeated local rollout with action-prefix prediction. Given the current latent and a candidate action sequence, Fast-LeWM encodes its prefixes and predicts the future latents reached after executing those prefixes in parallel. By making action prefixes the basic prediction unit, Fast-LeWM directly models action effects accumulated to different extents over multiple horizons. This prefix-level supervision forces the model to learn how states continuously evolve under different action prefixes, rather than only fitting one-step state transitions. During planning, the predictor can use the last prefix token from the encoded action sequence to evaluate the corresponding future latent without explicitly rolling through each intermediate imagined state. Across multiple tasks, Fast-LeWM improves average success over LeWM while substantially reducing planning time, achieving lower open-loop latent loss whose growth becomes significantly slower as the rollout horizon increases.
TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes
How do we learn to hit a tennis backhand? Not from a thousand hours of tennis tournaments on TV - we work with a coach and practice. We argue this is also the right recipe for teaching dynamic skills to humanoid robots. This follows from a structural property of dynamic skills: the outcome is decided by a short, crucial portion of the trajectory - for a backhand, the ~20cm of racket travel around ball contact. Getting this interaction window right requires coordinating the whole motion, so that control, physics, and morphology act in concert. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the window comes out right. To this end, we introduce TaskNPoint, a training protocol which makes the coach-learner division of labor explicit. The human coach contributes four inputs: a discrete set of skills (e.g. different shots), one demonstration per skill, identification of the interaction window, and the goal. Learning in a physically realistic simulation environment fills in each action trajectory and provides robustness to unmodeled events. Crucially, randomized target sampling during training lets a single demonstration generalize zero-shot to unseen goal locations. We test this approach on a Unitree G1 humanoid that hits forehands and backhands against balls thrown by a human, kicks incoming soccer balls, and picks and places boxes from novel locations. We find that learning is successful from short human video demonstrations and under an hour of training on a single GPU, with no per-task reward tuning.
RoboTales: ROBOTic Anthropomorphic LEarning Systems
RoboTales is a low-cost robotic storytelling system that animates narratives using expressive sock puppetry. Implemented autonomously on a Baxter robot as a test case, RoboTales synchronizes narration, gestures, and mouth movements to perform character-driven stories. In a pilot study, puppet-based storytelling outperformed a gesture-only mode, producing higher HRIES ratings and improved story recall, suggesting that embodied puppetry enhances engagement and narrative comprehension. Designed to be modular and platform-agnostic, RoboTales can be adapted to other manipulators and offers a screen-free alternative to passive media, supporting future deployment in child-centered learning environments.
comment: 4 pages, 4 figures, HRI Companion '26: Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, Student Design Challenge
OmniContact: Chaining Meta-Skills via Contact Flow for Generalizable Humanoid Loco-Manipulation
Learning long-horizon humanoid loco-manipulation poses a dual challenge: it requires not only the robust execution of meta-skills but also their seamless, closed-loop chaining equipped with autonomous recovery. Existing approaches remain limited: explicit humanoid-object interaction representations offer precision but are notoriously difficult for high-level planning, whereas implicit skill embeddings are compact but lack the interpretability required for reliable composition. We propose \ours, a hierarchical framework centered on \textbf{contact flow (CF)}, a compact representation consisting of key body trajectories and time-series binary contact signals. Leveraging this shared interface, our low-level policy \textbf{CF-Track} learns a unified library of loco-manipulation skills, while our high-level module \textbf{CF-Gen} heuristically synthesizes future contact-flow sequences. To support this setting, we additionally collect the OmniContact dataset, a MoCap-based HOI corpus for humanoid loco-manipulation (Appendix~\ref{sec:dataset}). Together, they enable robust execution, autonomous failure recovery, and flexible composition of meta-skills for long-horizon tasks. Experiments show that OmniContact achieves \(98.7\%\) success on \textit{Carry Box} and \(76.5\%\) on \textit{Push-Stack Boxes}, outperforming prior baselines by average margins of \(40.9\%\) in meta-skill and \(66.5\%\) in skill chaining. Besides, our framework naturally integrates with VLMs for semantic task decomposition, enabling complex, semantically grounded loco-manipulation behaviors, such as arranging scattered boxes into a heart shape.
Morphology-Specific Closed-Loop Control of Logarithmic-Spiral Continuum Arms via Online Jacobian Error Compensation
Logarithmic spirals are ubiquitous in biological appendages and provide an attractive morphology for continuum manipulators capable of reaching, wrapping, and grasping. Recently reported logarithmic-spiral robots demonstrated scalable fabrication and versatile grasping but lacked inverse kinematics and closed-loop control. This work presents the first morphology-specific closed-loop task-space control framework for logarithmic-spiral continuum arms. A segmented tendon-driven model with a centerline backbone and equilateral tendon routing is developed in MuJoCo to capture tapered compliance and contact dynamics. An analytical task-space Jacobian is derived directly from the logarithmic-spiral kinematics and combined with online Jacobian error compensation using a Broyden secant update and Kalman-filter estimation. The resulting controller continuously corrects modeling errors arising from nonlinear deformation, contact, and geometric mismatch. The framework is validated through planar and spatial simulations, including trajectory tracking, attitude regulation, disturbance rejection, three-dimensional position tracking, and simultaneous position-orientation control. Compared with a piecewise-constant-curvature (PCC) baseline, the proposed method consistently reduces tracking errors, suppresses attitude drift, and maintains a bounded Jacobian estimation error. The controller is further applied to morphology-enabled manipulation tasks, including obstacle-assisted reach-wrap-release motions, adaptive whole-arm grasping, and cooperative multi-arm object handling. Results demonstrate that combining logarithmic-spiral morphology with online Jacobian compensation enables accurate, robust, and scalable control of highly underactuated continuum manipulators. The proposed framework establishes a physics-grounded baseline for future hardware implementation and learning-augmented soft robotic control.
LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective
Building a generalist robot that can leverage prior knowledge for continuous task adaptation remains a significant challenge. Previous works alleviate the catastrophic forgetting problem by parameter-efficient fine-tuning for single-task adaptation. However, they fail to extract reusable skills and model the interaction with other skills effectively. Recent works try to address these issues by learning prompts. Differently, this paper presents an architectural perspective on the Lifelong Mixture of Dynamic Experts (\textit{LiMoDE}), a novel two-stage learning scheme for lifelong robot manipulation. Specifically, a dynamic MoE structure is first proposed in the multi-task pre-training stage to learn prior knowledge, where a varied number of heterogeneous experts are activated based on the motion information to address different short-term manipulations. Subsequently, in the task adaptation stage, we design a lifelong MoE adaptation mechanism % (LiMoEAM) that learns lifelong experts and dynamically combines them with frozen ones for new tasks, facilitating the knowledge transfer during adaptation. The proposed \textit{LiMoDE} is evaluated on both the simulated lifelong learning benchmark and real-world tasks. Extensive experiments demonstrate its effectiveness in achieving superior performance and strong lifelong adaptation by introducing a moderate number of additional trainable parameters and inference overhead.
RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards
Reinforcement learning (RL) for robotic manipulation often requires manually designing a dense reward function, which is difficult to tune and often fragile, or learning a reward from human demonstrations or preferences, which can be expensive. A recent line of work uses pretrained vision-language models (VLMs) as zero-shot reward models, replacing these costs with a single text prompt. However, we argue that a single global prompt is too coarse for long-horizon manipulation tasks with randomized initial conditions. The single-prompt VLM reward is near-flat for much of the trajectory, making early progress hard for the agent to detect. We propose Reinforced Micro-Task Learning (RMTL), an approach that decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step, the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task and averaged across multiple camera views to reduce the effect of view-specific occlusions. A reverse curriculum gradually exposes the agent to harder initial conditions, while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task. We then replace this rule with a learned hierarchical manager, turning rule-based phase selection into a fully learned hierarchical policy. We instantiate RMTL on the Fetch manipulation environment using three short stage-specific prompts and without additional prompt tuning. Experiments show that RMTL provides more informative reward signals than single-prompt VLM rewards, enabling faster learning. These results suggest that decomposing VLM rewards into micro-task-specific language prompts can substantially improve the scalability of language-guided reinforcement learning for robotic manipulation.
comment: 16 pages, 11 figures
Beyond Topology: A Morphological Symmetry Graph Representation for Locomotion Policy Learning
Reinforcement learning has enabled impressive locomotion skills on articulated robots, but common policy representations remain only weakly aligned with robot physics. Generic networks ignore kinematic structure, while graph-based policies encode connectivity without specifying how physical quantities transform across symmetric body parts. We introduce a morphological symmetry graph representation for locomotion policy learning and instantiate it in MS-PPO. Starting from the robot's topological graph, our representation augments each observation and action space with the permutation and sign transformations induced by morphological symmetry. This yields a symmetry-equivariant graph actor and a symmetry-invariant graph critic, enforcing the desired policy and value constraints by construction rather than through reward shaping or data augmentation. We evaluate MS-PPO on a variety of locomotion tasks using both Unitree Go2 quadruped and Unitree G1 humanoid, including command tracking, asymmetric joint failures, out-of-distribution command generalization, and zero-shot sim-to-real deployment. Experiments show improved symmetry generalization, robustness, sample efficiency, and model efficiency over topology- and symmetry-aware baselines.
comment: Project page: https://msppo.github.io/
RoboRouter: Training-Free Policy Routing for Robotic Manipulation
Research on robotic manipulation has developed a diverse set of policy paradigms, including vision-language-action (VLA) models, vision-action (VA) policies, and code-based compositional approaches. Concrete policies typically attain high success rates on specific task distributions, but limited generalization beyond it. Rather than proposing another monolithic policy, we propose to leverage the complementary strengths of existing approaches through intelligent policy routing. We introduce RoboRouter, a training-free framework that maintains a pool of heterogeneous policies and learns to select the best-performing policy for each task through accumulated execution experience. Given a new task, RoboRouter constructs a semantic task representation, retrieves historical records of similar tasks, predicts the optimal policy choice without requiring trial-and-error, and incorporates structured feedback to refine subsequent routing decisions. Integrating a new policy into the system requires only a lightweight evaluation and does not incur training overhead. Across simulation benchmark and real-world evaluations, RoboRouter consistently outperforms individual policies, improving the average success rate by more than 3% in simulation and 13% in real-world settings, while preserving execution efficiency. Our results demonstrate that intelligent routing across heterogeneous, off-the-shelf policies provides a practical and scalable pathway toward building more capable robotic systems.
$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models
Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.
comment: Add some ablation experiments
Engineering Reliable Autonomous Systems: Challenges and Solutions
Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop "Engineering Reliable Autonomous Systems" (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration.
RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation
Learned world models hold significant potential as neural simulators for robotic manipulation. However, prevalent 2D video-based models inherently lack the spatial and kinematic reasoning crucial for physical interactions. We introduce RoDyn, a novel Robot-Dynamic 2.5D World Model that formulates environmental dynamics within a highly efficient, geometry-aware latent space. Through the proposed Robot-Dynamic Tokenizer, we explicitly couple semantic visual appearances with spatial and agent-centric priors via an RGB-dominated cross-attention mechanism and dynamic mask guidance. Furthermore, by injecting these mask priors directly into sequence transitions, our Mask-guided Autoregressive architecture drives the model to focus on active robot-object interaction regions. Extensive experiments demonstrate that RoDyn establishes SOTA generation fidelity across large-scale datasets. Crucially, it translates these predictive capabilities into substantial downstream gains, accelerating model-based reinforcement learning and achieving a 42\% improvement in real-world imitation learning success rates over pure 2D baselines.
Spotlighting Task-Relevant Features: Object-Centric Representations for Better Generalization in Robotic Manipulation
The generalization capabilities of robotic manipulation policies are heavily influenced by the choice of visual representations. Existing approaches typically rely on representations extracted from pre-trained encoders, using two dominant types of features: global features, which summarize an entire image via a single pooled vector, and dense features, which preserve a patch-wise embedding from the final encoder layer. While widely used, both feature types mix task-relevant and irrelevant information, leading to poor generalization under distribution shifts, such as changes in lighting, textures, or the presence of distractors. In this work, we explore an intermediate structured alternative: Slot-Based Object-Centric Representations (SBOCR), which group dense features into a finite set of object-like entities. This representation permits to naturally reduce the noise provided to the robotic manipulation policy while keeping enough information to efficiently perform the task. We benchmark a range of global and dense representations against intermediate slot-based representations, across a suite of simulated and real-world manipulation tasks ranging from simple to complex. We evaluate their generalization under diverse visual conditions, including changes in lighting, texture, and the presence of distractors. Our findings reveal that SBOCR-based policies outperform dense and global representation-based policies in generalization settings, even without task-specific pretraining. These insights suggest that SBOCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.
TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.
PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking
Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation
Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.
ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data
Achieving versatile and natural whole-body humanoid interaction control remains challenging due to the high cost of whole-body teleoperation data. We present ZeroWBC, a teleoperation-free framework that learns humanoid whole-body interaction from human egocentric videos paired with synchronized whole-body motion and text annotations. ZeroWBC adopts a generation-then-tracking formulation to tackle the static scene whole-body interaction control problem. Given an initial egocentric image and a language instruction, a fine-tuned Vision-Language Model generates future human whole-body motion tokens, which are decoded into continuous motions and retargeted to the humanoid. The resulting reference motions, together with root and key body-part trajectories, are then executed by a general interactive motion tracking policy. To improve interaction performance, we introduce an interaction-oriented tracking reward that prioritizes global root and key body-part trajectory alignment while preserving natural whole-body motion. Experiments on the Unitree G1 humanoid robot show that ZeroWBC enables diverse scene-aware behaviors without robot teleoperation demonstrations. These results suggest a scalable paradigm for learning natural humanoid whole-body interaction from human egocentric data.
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation
As the demand for mobile robots continues to increase, social navigation has emerged as a critical task, driving active research into deep reinforcement learning (RL) approaches. However, because pedestrian dynamics and social conventions vary widely across different regions, simulations cannot easily encompass all possible real-world scenarios. Real-world RL, in which agents learn while operating directly in physical environments, presents a promising solution to this issue. Nevertheless, this approach faces significant challenges, particularly regarding constrained computational resources on edge devices and learning efficiency. In this study, we propose incremental residual RL (IRRL). This method integrates incremental learning, which is a lightweight process that operates without a replay buffer or batch updates, with residual RL, which enhances learning efficiency by training only on the residuals relative to a base policy. Through the simulation experiments, we demonstrated that, despite lacking a replay buffer, IRRL achieved performance comparable to those of conventional replay buffer-based methods and outperformed existing incremental learning approaches. Furthermore, the real-world experiments confirmed that IRRL can enable robots to effectively adapt to previously unseen environments through the real-world learning.
comment: RO-MAN 2026, video https://youtu.be/NJOhv56CrPM
GeoFlow-SLAM++: A Robust Multi-Camera Visual-Inertial SLAM System with Relocalization
Monocular and RGB-D visual-inertial SLAM systems remain susceptible to limited field of view, sensor-specific failure modes, and unreliable cross-session relocalization. To address these issues, we present GeoFlow-SLAM++, a tightly coupled multi-camera visual-inertial SLAM system that extends GeoFlow-SLAM from a single RGB-D sensor to a calibrated multi-camera rig with a unified body-centric formulation. Within this multi-camera framework, GeoFlow-SLAM++ supports two interchangeable visual front-ends: a conventional ORB front-end and a neural network feature (NN-Feature) front-end built on SuperPoint and LightGlue. The system unifies tracking, mapping, and relocalization on a shared body state, and combines multi-camera reprojection constraints, IMU pre-integration, cross-view place recognition, and dual-stream optical flow/NN-Feature tracking for robust localization. As an optional extension, the system can further incorporate cross-view-consistent pseudo-depth predictions from RGB images as auxiliary geometric constraints. We evaluate GeoFlow-SLAM++ on EuRoC, OpenLORIS, TUM, Hilti, and a self-collected handheld multi-camera dataset. Results show that the NN-Feature front-end improves robustness in appearance-challenging scenarios, the multi-camera formulation achieves competitive localization accuracy on Hilti, and the unified cross-view relocalization design reaches LiDAR-comparable performance on the handheld dataset.
comment: 10 pages
ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles
Visual navigation models often struggle in real-world dynamic environments due to limited robustness to the sim-to-real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real-to-sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate these challenges, prior GS-based works have considered only static scenes or non-photorealistic human obstacles built from simulator assets, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy-Go, a novel real-to-sim simulation pipeline that synthesizes photorealistic dynamic scenarios in target environments by augmenting a reconstructed static GS scene with dynamic human GS obstacles, and trains navigation policies using the generated datasets. The pipeline provides three key contributions: (1) a dynamic GS simulator that integrates static scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) a navigation dataset generation framework that leverages the simulator along with a robot expert planner designed for dynamic GS representations and a human planner, and (3) robust navigation policies to both the sim-to-real gap and moving obstacles. The proposed simulator generates thousands of photorealistic navigation scenarios with animatable human GS avatars from arbitrary viewpoints. ReaDy-Go outperforms baselines across target environments in both simulation and real-world experiments, demonstrating improved navigation performance even after sim-to-real transfer and in the presence of moving obstacles. Moreover, zero-shot sim-to-real deployment in an unseen environment indicates its generalization potential. Project page: https://syeon-yoo.github.io/ready-go-site/.
comment: Accepted by IEEE Robotics and Automation Letters (RA-L). Project page: https://syeon-yoo.github.io/ready-go-site/
FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry IROS 2026
Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.
comment: Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Uncertainty-Aware Ankle Exoskeleton Control
Lower limb exoskeletons show promise to assist human movement, but their utility is limited by controllers designed for discrete, predefined actions in controlled environments, restricting their real-world applicability. We present an uncertainty-aware control framework that enables ankle exoskeletons to operate safely across diverse scenarios by automatically disengaging when encountering unfamiliar movements. Our approach uses an uncertainty estimator to classify movements as similar (in-distribution) or different (out-of-distribution) relative to actions in the training set. We evaluated three architectures (model ensembles, autoencoders, and generative adversarial networks) on an offline dataset and tested the strongest performing architecture (ensemble of gait phase estimators) online. The online test demonstrated the ability of our uncertainty estimator to turn assistance on and off as the user transitioned between in-distribution and out-of-distribution tasks (F1: 89.2). This new framework provides a path for exoskeletons to safely and autonomously support human movement in unstructured, everyday environments.
Residual RL-MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow IROS 2026
Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow in simulation. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot-cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Simulation results show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.
comment: 8 pages, 8 figures. Accepted by IEEE/RSJ IROS 2026. Revised manuscript with reviewer feedback
Visual-Language-Guided Task Planning for Horticultural Robots
Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Vision Language Model (VLM) to guide robotic task planning by actively querying heterogeneous data sources, including enriched RGB camera feeds and 2D semantic occupancy maps, interleaved with robotic action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop monitoring tasks in monoculture and polyculture environments. Our results show that while zero-shot VLMs perform robustly for short-horizon tasks (achieving 87% success, comparable to human experts), success drops significantly to under 10% for complex long-horizon, multi-target tasks. Despite this decline, task completion rates remain above 76% under noiseless conditions. Critically, the system degrades when relying on noisy semantic maps, demonstrating a key limitation in current VLM context grounding for sustained robotic operations. This work offers a deployable framework and critical insights into VLM capabilities and shortcomings for complex agricultural robotics.
comment: 18 pages, 6 figures
Multiagent Systems
Can Trustless Agents Be Trusted? An Empirical Study of the ERC-8004 Decentralized AI Agent Ecosystem
As autonomous AI agents increasingly transact across organizational boundaries, a fundamental trust challenge emerges: how can an agent assess whether an unknown counterpart is trustworthy? The ERC-8004 protocol addresses this challenge with the first permissionless trust layer for AI agent economies, built around three on-chain registries for Identity, Reputation, and Validation. Despite its rapid adoption, the protocol has not been studied empirically, leaving it unclear whether the information it records provides a trustworthy basis for decision-making. To address this gap, we present the first empirical study of ERC-8004 across three chains: Ethereum, BNB Smart Chain (BSC), and Base, covering the period from protocol deployment through May 13, 2026. We crawl on-chain Identity and Reputation events, off-chain files, and x402 payment transactions. On the identity side, we find that most registrations are placeholders rather than active agents, with only a small fraction (3%, 4%, and 15% across Ethereum, BSC, and Base) exposing a valid ERC-8004 registration file with at least one live service endpoint. On the reputation side, we show that the Registry, as currently deployed, cannot function as a trust signal: values are not commensurable, feedback records are rarely grounded in verifiable interactions, and reputation can be manipulated at minimal cost. Consistent with these design weaknesses, we find that a substantial fraction of reviewers (73.6%, 59.2%, and 90.6% across Ethereum, BSC, and Base) exhibit coordinated Sybil behavior. After removing Sybil-flagged feedback, 15.5%, 72.3%, and 89.4% of rated agents, respectively, are left with no valid feedback. We then turn these findings into concrete recommendations for future revisions of ERC-8004. Our study yields actionable protocol-design implications and establishes an empirical baseline for research on AI agent markets.
Variable Bound Tightening for Nash Equilibrium Computation in Multiplayer Imperfect-Information Games
There has been significant recent progress in algorithms for approximation of Nash equilibrium in large two-player zero-sum imperfect-information games and exact computation of Nash equilibrium in multiplayer strategic-form games. While counterfactual regret minimization and fictitious play are scalable to large games and have convergence guarantees in two-player zero-sum games, they do not guarantee convergence to Nash equilibrium in multiplayer games. Recently, an approach has been presented for exact computation of Nash equilibrium in multiplayer imperfect-information games that solves a quadratically constrained program based on a nonlinear complementarity problem formulation derived from the sequence-form game representation. This formulation was solved using Gurobi's nonconvex quadratic solver, which employs spatial branch-and-bound to iteratively refine variable bounds by solving convex relaxations of bilinear terms via McCormick envelopes. During presolve, Gurobi introduces auxiliary variables and, in some cases, binary variables, leading to an internal MIQCP reformulation. This approach was demonstrated to outperform prior algorithms from the Gambit software suite and quickly solve three-player Kuhn poker after removal of dominated actions; however, the algorithm was not able to solve the full version of the game within 24 hours. In this paper, we derive finite bounds on slack and multiplier variables in the nonlinear complementarity formulation. These bounds strengthen the convex relaxations used within spatial branch-and-bound and lead to substantial computational improvements. We demonstrate the impact of the proposed bounds on exact Nash equilibrium computation in three-player Kuhn poker.
Multi-Agent Goal Recognition with Team- and Goal-Conditioned Reinforcement Learning and Factorized Branch-and-Bound
Multi-agent goal recognition asks an observer to jointly infer which agents act together and what each team is trying to achieve, so the hypothesis space grows combinatorially with the number of team partitions and goals per team. Real applications such as drone surveillance and collaborative robotics expose only the agents' trajectory, which forces the observer to rank team-goal hypotheses from behavior alone. Multi-Agent Goal Recognition with Branch-and-Bound (MAGR-BB) addresses this setting with a shared team- and goal-conditioned policy used as the scoring model inside a factorized branch-and-bound search. On a controlled multi-agent Blocksworld benchmark, MAGR-BB returns the same top-ranked hypothesis as exhaustive search throughout the trajectory while cutting hypothesis materialization by orders of magnitude and reducing cumulative recognition runtime substantially.
comment: 12 pages, 1 figure, 2 tables
Manipulation Is Task-Dependent: A Multi-Axis, Multi-Environment Evaluation of Frontier LLMs
We evaluate manipulative behavior in six frontier language models across six environments, ranging from negotiation tasks to agentic workflows, resulting in 13{,}590 individual scenarios. Manipulation rates are measured across three axes: framing (mandate honesty or permit manipulation), incentive structure (from no incentives to substantial ones), and task difficulty. Existing benchmarks typically vary a single axis within a single environment, an approach our results show is insufficient. We rank models by manipulation rate and find Spearman rank correlations across environments average $ρ= 0.055$, indicating manipulative tendencies in one task do not necessarily predict those in another. Additionally, we find the axis that drives manipulation varies across different environments. In environments where models are incentivized to misrepresent future actions, instructional framing and structurally binding incentives are the primary drivers; in environments where models are incentivized to misrepresent a ground truth, task difficulty dominates. This split was identified in five environments and validated against a sixth held-out environment. Together, these findings illustrate the importance of rigorous multi-dimensional evaluations when measuring manipulative propensities.
Robustness and Leadership in Markov-switching Consensus Networks
We investigate how time-varying interactions, modeled via a Markov switching graph (MSG), impact the robustness of noisy multi-agent dynamics in both continuous- and discrete-time settings. Our focus is on the steady-state performance of consensus and leader-follower tracking dynamics subject to stochastic noise. Using the framework of Markov jump linear systems (MJLS), we derive expressions for the steady-state covariance of each agent's deviation from consensus and tracking error, respectively, and use them to quantify individual and group performance as a function of the interaction graphs and the switching dynamics. We extend established notions of robustness, certainty indices, and joint centrality from static graphs to the MSG setting. To gain analytical insight, we specialize our results to systems switching between two topologies and characterize how switching influences performance. Numerical simulations further illustrate how switching topologies affects system robustness in both coordination tasks.
comment: An extended version of an earlier IEEE CDC paper
Agentic evolution of physically constrained foundation models
Artificial intelligence increasingly drives automated scientific discovery, yet contemporary generalist agents lack physical grounding, frequently hallucinating hardware-incompatible designs. Here, we present a physically grounded, multi-agent discovery engine that autonomously architects hardware-compliant computing systems. Anchored by an Evolutionary Knowledge Graph structuring past scientific innovations, the framework extracts an "algorithmic Chain-of-Thought" to transform blind stochastic search into directed structural evolution. Applied to the extreme testbed of foundation model deployment, the engine evolved two hardware-aware compression methodologies surpassing human-engineered heuristics: Q-Enhance mitigates long-context accuracy loss in dense models, and MoE-Salient-AQ outperforms state-of-the-art manual sparse Mixture-of-Experts designs by 3.7% at sub-3-bit regimes. Utilizing a bandwidth-efficient Sensitivity Profile, we successfully deployed a massive 235-billion-parameter model onto a constrained dual-A100 server, reducing memory requirements by 75% with a marginal 0.64% accuracy degradation. By transforming unconstrained combinatorial search into knowledge-driven autonomy, this establishes a scalable hardware-software co-design paradigm for machine-driven discovery within strict physical boundaries.
comment: 29 pages, 5 main figures and 4 extended data figures
Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning
Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents' actions, the independent actors setting considers each agent to act based only on its local information, thus having more flexible applications. However, in the sequential update framework, it is required to re-estimate the joint advantage function after each individual agent's policy step. Despite the practical success of importance sampling, the updated advantage function suffers from exponentially high variance problems, which likely result in unstable convergence. In this work, we first analyze the high variance advantage both empirically and theoretically. To overcome this limitation, we introduce a clipping objective to control the upper bounds of the advantage fluctuation in sequential updates. With the proposed objective, we provide a monotonic bound with sub-linear convergence to $ε$-Nash Equilibria. We further derive two new practical algorithms using our clipping objective. The experiment results on three popular multi-agent reinforcement learning benchmarks show that our proposed method outperforms the tested baselines in most environments. By carefully analyzing different training settings, our proposed method is highlighted with both stable convergence properties and the desired low advantage variance estimation. For reproducibility purposes, our source code is publicly available at https://github.com/giangbang/Low-Variance-Trust-Region-MARL.
Rate-Aware Quantum-Inspired Trajectory Learning for Interference-Limited Multi-UAV Networks
Unmanned aerial vehicle (UAV) can provide on-demand, high-capacity connectivity in disaster and normal situation. However, it faces a challenge of curse of dimensionality in trajectory optimization, where interference-limited environments and vast search spaces make real-time coordination computationally expensive. To overcome this challenge, we propose the Rate-Aware Quantum-Annealed Graph Condensation (RA-QAGC) scheme, which combines rate-aware graph abstraction with decentralized reinforcement learning to enable scalable, interference-aware UAV coordination. By identifying high throughput locations and guiding UAV trajectory adaptation toward throughput-optimal regions, RA-QAGC effectively balances network capacity by maintaining quality-of-service (QoS) requirements. Simulation results demonstrate the proposal outperformed over existing schemes by achieving 59.4 Mbps total throughput and 23.9 Mbps priority-user throughput, representing gains of approximately 15% and 34%, respectively, over the baseline schemes.
Agentic Knowledge Tracing: A Multi-Agent LLM Architecture for Stealth Assessment of Financial Literacy in Serious Games
Assessing financial literacy during gameplay without disrupting the learning experience remains a key challenge in serious games for education. We present the Agentic BKT pipeline, a multi-agent large language model architecture for stealth assessment of financial competencies from open-ended gameplay events. The pipeline processes events from a 2D platformer serious game aligned with the OECD/INFE financial literacy framework through four phases: (1) the game captures every player decision as a structured event log; (2) an LLM event classifier labels each action on a four-point rubric validated against three domain experts (Fleiss kappa = 0.624, substantial agreement); (3) four domain-specific agents specializing in risk mitigation, investing, spending, and credit management perform session-level reasoning over behavioral trajectories, feeding per-competency Bayesian Knowledge Tracing that estimates mastery within each domain; and (4) an expert judge agent synthesizes the domain-level estimates into an overall mastery score. Evaluated with 193 K-12 participants across 264 game sessions, the Agentic BKT pipeline yields mastery estimates significantly correlated with learning gain (r = 0.276, p = 0.0001) and post-test scores (r = 0.333, p < 0.0001) while showing no correlation with pre-test scores, providing both convergent and discriminant validity. The multi-agent approach approximately triples the predictive validity of a single-LLM baseline (r = 0.095, not significant) in this study, demonstrating that domain decomposition and session-level reasoning play a central role in capturing the multidimensional nature of financial literacy from gameplay
comment: 8 pages, 5 figures, IEEE CoG 2026
Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care
Post-discharge clinical follow-up is critical for maintaining continuity of care and mitigating long-term health risks. However, traditional follow-up paradigms suffer from shortage of health workforce, fragmented patient histories, and information silos across clinical departments. While large language models have demonstrated potential in medical question-answering, their deployment in continuous care is hindered by hallucination risks and a fundamental inability to reason over longitudinal, patient-specific constraints. Here we present Healink, a memory-enhanced multi-agent framework to support AI-assisted post-discharge follow-up by generating prescription-grounded, traceable responses that improved completeness and perceived clinical utility in retrospective and physician-blinded evaluations. The architecture seamlessly integrates a triage routing mechanism, a unified memory enhancement module utilizing a robust relational database for optimal latency, and a strict constraint-based retrieval-augmented generation engine. By vectorizing historical clinical records and employing weighted similarity functions across diverse phenotypic and intervention dimensions, Healink ensures precise inter-patient and intra-patient case matching while actively preventing cross-departmental drug conflicts. We evaluated Healink on a dataset comprising 400 continuous and 85 highly complex real-world follow-up cases, alongside the webMedQA benchmark. In a rigorous single-blind evaluation conducted by clinical experts, the framework outperformed human physician baselines in both authoritativeness and clinical safety. By generating a traceable, white-box evidence chain, Healink provides a scalable, safe, and highly effective paradigm for intelligent patient management, ultimately enhancing societal healthcare outcomes.
comment: 23 pages, 10 figures
EvoFlock: evolved inverse design of multi-agent motion
This paper describes an automatic method for adjusting or tuning models of multi-agent motion. Simulating the motion of bird flocks, human crowds, vehicle traffic, and other multi-agent systems is a widely used technique. These simulations model the behavior of a single group member (bird, human, or vehicle). The group behaviors (flock, crowd, traffic) emerge from interactions between group members. These models typically have many numerical control parameters. Even if each parameter is intuitive in isolation, their interaction can be complex and nonlinear. It is challenging to determine which parameters to adjust for the desired change in group behavior. Changing one aspect of group behavior often causes other aspects to change, leading to a tedious process of incremental changes. This work takes an inverse design approach. The desired group behavior is measured with a user-defined objective(/fitness/loss) function and optimized with a genetic algorithm. The objective function used here for basic flocking rewards proper spacing with neighbors, flying near a desired speed, and avoiding obstacles. Interestingly, the vivid alignment seen in bird flocks appears to emerge from maintaining proper spacing between flockmates.
comment: To appear in the Proceedings of Artificial Life 2026
SOLAR: AI-Powered Speed-of-Light Performance Analysis
How fast could a deep-learning model run on target hardware, and how far is today's implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload's theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically derives validated SOL bounds from PyTorch and JAX source code. SOLAR leverages both generative and deterministic components in its flow: an LLM frontend translates any source programs into an executable Affine Loop IR, validated by output comparison; a deterministic flow lifts the IR into an einsum graph; and an analytical backend computes unfused, fused, and cache-aware SOL bounds. SOLAR provides comprehensive operator and language coverage, produces validated bounds with zero observed SOL violations, and offers multi-fidelity analysis that tightens bounds and surfaces optimization insights. We evaluate SOLAR across KernelBench, JAX/Flax models, and robotics workloads. These experiments demonstrate four use cases: headroom analysis at multiple fidelity levels, identifying optimization opportunities, cross-platform exploration, and inverse-roofline hardware provisioning.
Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems ICML 2026
Practitioners of prompt-composed agentic systems report a recurring failure mode: editing one prompt module silently shifts the behavior of others despite no shared variable or executable dependency. We formalize this as compositional behavioral leakage (CBL): interference between modules sharing a context window. CBL is enabled by architectural non-isolation: transformer self-attention provides no formal boundary between concatenated modules. We probe CBL on a deployed job-evaluation agent (Claude Sonnet 4.6, 144 trials) through a reusable three-channel protocol that perturbs non-focal modules along volume, content, and form. Only the content channel produces a detectable paired effect (Cohen's d = 0.63, bootstrap 95% CI excluding zero); no recommendation flipped -- a sub-threshold regime invisible to standard QA but compounding across the thousands of decisions a deployed agent makes. CBL is orthogonal to known agent-failure axes (adversarial injection, cognitive degradation, multi-agent fault propagation, privacy leakage). We contribute an operational definition, a reusable protocol, a falsifiable prediction set, and a system-class characterization, establishing cross-module interference measurement as a requirement for prompt-composed agent evaluation.
comment: 8 pages, 2 tables. Accepted to the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), Seoul, South Korea
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves. This ignores a central feature of evolution: species adapt as their environments change with them. We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks. We introduce the Red Queen Godel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self-improvement guarantees hold per epoch as the objective evolves across them. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens. We then turn to scientific paper writing and reviewing, and Olympiad-level proof writing and grading, where the RQGM improves performance over prior self-improving agents: co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy. In paper reviewing, the strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.
comment: 12 pages main text + 21 pages appendix (37 pages total, incl. references); 10 figures (6 main text + 4 appendix); 10 tables (2 main text + 8 appendix). Preliminary preprint; work in progress. Keywords: self-improving agents, learned evaluation, multi-agent systems, auto- mated scientific discovery, controlled utility evolution, co-evolutionary search, autoresearch
Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols
As AI agent protocols proliferate, the governance structures shaping their interoperability standards remain empirically underexamined. We introduce an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures at scale. We validate it on two contrasting standards for agent interoperability: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, we combine LLM-assisted coding, topic modeling, and multi-layer network analysis to examine how institutional design shapes thematic priorities and community structure. We find that while governance form influences substantive focus, both regimes exhibit comparable levels of participation inequality and community fragmentation. Discourse alignment is denser in the permissionless setting, suggesting that open governance may foster greater thematic convergence despite decentralized participation. These findings illustrate how LLM-assisted methods can advance the empirical study of technology governance, with implications for designing more equitable agentic AI standards. All data and code are openly available.
Engineering Reliable Autonomous Systems: Challenges and Solutions
Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop "Engineering Reliable Autonomous Systems" (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration.
Introduction to Automated Negotiation
This book is an introductory textbook targeted towards computer science students who are completely new to the topic of automated negotiation. It does not require any prerequisite knowledge, except for elementary mathematics and basic programming skills. This book comes with an simple toy-world negotiation framework implemented in Python that can be used by the readers to implement their own negotiation algorithms and perform experiments with them. This framework is small and simple enough that any reader who does not like to work in Python should be able to re-implement it very quickly in any other programming language of their choice.
Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems
Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.
Specifying AI-SDLC Processes: A Protocol Language for Human-Agent Boundaries
AI agents now participate as first-class team members across the software development lifecycle, yet no specification language exists for expressing the human-agent responsibility boundaries, approval gates, and governance constraints this collaboration requires. Existing approaches encode process in agent prompts (subject to drift), target adjacent domains (workflow management, business processes), or address only fragments (access control, approval gates). We propose a domain-specific language for specifying AI-SDLC processes as protocols, with formal syntax, well-formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non-determinism through primitives such as validation tokens and capability boundaries. Three results follow. A failure rate analysis shows that structural enforcement bounds system failure rates at a weighted product of agent and validator rates, while behavioral compliance permits cumulative or near-saturating growth. The 2+N team pattern (two human-in-control roles plus N specialized agent members) formalizes classical Separation of Duties for AI-SDLC. Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties rather than special-case constructs. We position the contribution against multi-agent frameworks (MetaGPT), workflow specification (FlowAgent, BPMN extensions), and capability-based security (SAGA): the novelty lies in the specific integration, not any single primitive. A working implementation demonstrates feasibility; empirical evaluation is future work.
comment: Position paper with formal specification, failure rate analysis, and feasibility demonstration. Companion empirical paper and open-source implementation forthcoming
Chisme: Heterogeneity-Aware Gossip Learning
As end-user device capability increases and demand for intelligent services at the Internet's edge rises, distributed learning has emerged as a key enabling technology for the intelligent edge. Existing approaches like federated learning (FL) and decentralized FL (DFL) enable privacy-preserving distributed learning among clients, while gossip learning (GL) approaches have emerged to address the potential challenges in resource-constrained, connectivity-challenged infrastructure-less environments. However, most distributed learning approaches assume largely homogeneous data distributions and may not consider or exploit the heterogeneity of clients and their underlying data distributions. This paper introduces Chisme, a novel fully decentralized distributed learning algorithm designed to address the challenges of implementing robust intelligence in network edge contexts characterized by heterogeneous data distributions, episodic connectivity, and sparse network infrastructure or lack thereof. Chisme leverages the affinity between clients' underlying data distributions calculated from received model exchanges to inform how much influence received models have when merging into the local model. By doing so, it enables clients to strategically balance between broader collaboration to build more general knowledge and more selective collaboration to build specific knowledge. We evaluate Chisme against contemporary approaches using image recognition and time-series prediction scenarios while considering different network connectivity conditions, representative of real-world distributed intelligent systems running at the network's edge. Our experiments demonstrate that Chisme outperforms state-of-the-art edge intelligence approaches in almost every case -- clients using Chisme exhibit faster training convergence, lower final loss after training, and lower performance disparity between clients.
comment: This work has been submitted to the IEEE Globecom 2026 for review
Systems and Control (EESS)
Deep Reinforcement Learning-Enhanced Event-Triggered Data-Driven Predictive Control for a 3D Cable-Driven Soft Robotic Arm
Soft robots are challenging to control due to their nonlinear and time-varying dynamics. Data-enabled predictive control (DeePC) offers a model-free alternative by directly leveraging measured input-output trajectories to construct a predictive controller. However, its receding-horizon formulation requires solving a constrained optimization problem at every sampling instant, which can be computationally demanding for real-time deployment on resource-limited robotic platforms.To address this limitation, we propose an adaptive reinforcement-learning-based event-triggered DeePC (RL-ET-DeePC) framework for soft robotic control. A model-free RL policy is trained to determine when to invoke the DeePC optimizer based on the current system state representation, thereby reducing unnecessary optimization calls while preserving closed-loop performance.Simulation results show that RL-ET-DeePC reduces optimization frequency by up to 66% compared to periodic DeePC, while maintaining comparable tracking accuracy. Hardware experiments on a three-dimensional cable-driven soft robotic arm demonstrate zero-shot transfer, achieving a 34% reduction in optimization frequency with tracking accuracy comparable to periodic DeePC and more consistent performance than a static threshold-based event-triggered baseline.
Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface
Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are proposed to explain how controllers determine their control actions and their underlying working mechanism. The novel contributions of this work are threefold: First, the XCF is designed to provide model-agnostic explanations for controllers in closed-loop systems and can optionally refine local explanations by system response dynamics. Second, a novel explanation method, hierarchical fuzzy model-agnostic explanation for control systems (HFMAE-C), is proposed based on the designed framework. The HFMAE-C employs a fuzzy logic system to approximate the controller's behavior and system dynamics, providing sample, local, domain and universe level explanations via IF-THEN rules revealing the controller's decision logic and salience values quantifying the contribution of system states to control actions. Third, a large language model agent-supported user interface is developed to automatically analyze user requirements, select appropriate algorithms, interpret the generated explanations to a natural language report, and provide interactive consultation. Case studies on inverted pendulum system and Turtlebot obstacle avoidance demonstrate the effectiveness of the proposed method through simulated user experiments and quantitative comparisons with mainstream explainable control approaches.
Robustness and Leadership in Markov-switching Consensus Networks
We investigate how time-varying interactions, modeled via a Markov switching graph (MSG), impact the robustness of noisy multi-agent dynamics in both continuous- and discrete-time settings. Our focus is on the steady-state performance of consensus and leader-follower tracking dynamics subject to stochastic noise. Using the framework of Markov jump linear systems (MJLS), we derive expressions for the steady-state covariance of each agent's deviation from consensus and tracking error, respectively, and use them to quantify individual and group performance as a function of the interaction graphs and the switching dynamics. We extend established notions of robustness, certainty indices, and joint centrality from static graphs to the MSG setting. To gain analytical insight, we specialize our results to systems switching between two topologies and characterize how switching influences performance. Numerical simulations further illustrate how switching topologies affects system robustness in both coordination tasks.
comment: An extended version of an earlier IEEE CDC paper
Power-Budgeted Underwater Vehicle Control via Constrained Reinforcement Learning
Underwater vehicles operate from a fixed onboard energy budget that propulsion rapidly depletes, so a controller that completes its task while drawing less thruster power directly extends mission range and endurance. Reinforcement learning yields capable model-free controllers for station-keeping and trajectory tracking, but optimizing task accuracy alone drives the policy toward oscillatory, energy-wasting actuation. The established remedy subtracts an energy penalty from the reward, yet this sets the task-power trade-off through a single weight with no physical units: a target power level cannot be specified, the weight must be re-tuned for every vehicle and task, and a mismatched weight can even raise power. This paper instead formulates energy-efficient underwater control as a constrained Markov decision process in which average thruster power is subject to an explicit budget, solved with a PPO-Lagrangian algorithm. The power level is set by declaring a budget in physical units, and a single dual variable is updated online to meet it for each vehicle and task, without manual weight search. Across three vehicles and four tasks in the MarineGym simulator, the energy-constrained policy draws the least power in all twelve settings, reducing it by 14--65\% (up to 64.9\%) over a task-only baseline and below an energy-reward baseline everywhere, while remaining the smoothest in ten settings and preserving task accuracy except in one deliberately power-limited regime. Imposing energy as an explicit constraint thus offers a tuning-free route to energy-efficient underwater control that needs no per-vehicle, per-task weight search.
comment: 10 pages, 10 figures
Reference-Free Heterogeneous Multi-Agent Reinforcement Learning for Grid-Friendly Tie-Line Power Shaping in Industrial Microgrids
Tie-line power (TLP) shaping is a key requirement for the grid-friendly operation of industrial microgrids (IMGs). This paper studies the coordination of multi-timescale heterogeneous adjustable resources in a steel IMG to shape a grid-friendly TLP trajectory considering multiple objectives. A sequential heterogeneous-agent coordination (SHAC) framework is proposed, where process loads, hydrogen storage, and battery storage are modeled as functionally heterogeneous agents with cross-role observations, asynchronous decision intervals, role-specific rewards and critics. This design captures the heterogeneous temporal effects of different resources on the TLP trajectory and alleviates ambiguous credit assignment and weak inter-agent coordination. To ensure feasible real-time execution, process-knowledge-based action masking and feasibility projection are embedded into policy execution, and a role-aware multi-timescale actor--critic training scheme is developed for agents with different action structures and decision intervals. Numerical studies using real renewable generation and electricity market data show that SHAC effectively eliminates the dependence on predefined reference trajectories and enables adaptive 1-min online decision-making, achieving zero production failures with an average computational time of only 0.4 ms per step. Compared with the original operation, SHAC reduces the total grid purchase cost, contract-demand exceedance time, and cumulative ramp excess by 91.27\%, 98.64\%, and 96.91\%, respectively. These results demonstrate that the proposed framework improves the economic efficiency and grid friendliness of industrial microgrid operation while satisfying strict process-safety constraints and real-time computational requirements.
Control Barrier Function only Formation Tracking in Multi-Agent Systems
This paper presents a real-time control framework for formation tracking of heterogeneous multi-agent systems with non-linear dynamics. The proposed method formulates a single Control Barrier Function-like constraint within a quadratic optimization setting that addresses formation tracking. Relying on the relative information of neighboring agents, the controller is designed to operate without the need for manual parameter tuning or a separate nominal formation controller. The leader-follower framework is validated through simulations of moving formations.
comment: 9 pages, 2 figures
Conformal Recovery-Deadline Certificates for Runtime Assurance of Adapting Controllers
Runtime assurance (RTA) protects a safety-critical system by switching from an advanced controller to a verified safe controller when a monitored condition is violated. The standard latching rule, which trips on the first breach of the safe set and then coasts, is correct for a diverging controller but pathological for a capable online-adapting one. Such a controller is unsafe by design during a bounded recovery transient. It must excite the plant to identify the fault before it can correct it, so a latching shield trips on that transient and suppresses a controller that would have recovered. We introduce the conformal recovery-deadline certificate, a split-conformal, distribution-free, finite-sample upper bound on the adapting controller's recovery time that licenses delayed fallback with a coverage guarantee, backstopped by a verified monitor at a hard critical limit. The certified deadline discriminates capable from incapable controllers, keeping the recoverer autonomous while catching the diverger. The construction separates autonomy, governed by statistical coverage, from safety, governed by the verified backstop, as an instance of reliability-asymmetric design. We prove marginal coverage, a weighted extension that restores coverage under a known fault-distribution shift, and group-conditional Mondrian coverage. We demonstrate all three on two unrelated Simplex testbeds: a 6-DOF spacecraft attitude controller and a torque-controlled inverted pendulum. Both show the same suppression pathology and the same cure, making the certificate a domain-general mechanism rather than a single-system trick.
Deterministic Non-Smooth Safety via Dual-Algebraic Control Barrier Functions
This paper presents a dual-algebraic framework for control barrier functions (CBFs) that guarantees deterministic execution using exclusively elementary arithmetic. We develop this deterministic approach to solve a fundamental bottleneck in safety-critical control: pointwise minima naturally compose intersecting safe sets, but generate non-smooth boundaries where standard Lie derivatives fail. Existing mathematical workarounds inject approximation bias, probabilistic non-determinism, or combinatorial execution delays that strictly impede hard real-time hardware certification. By embedding the system state and vector field into the dual-number ring, our method extracts both the composite barrier value and its exact directional derivative in a single evaluation. The standard floating-point minimum deterministically isolates a single vertex of the Clarke generalized gradient for the quadratic-program solver. We prove this selected vertex constitutes a valid Clarke subgradient and the resulting simultaneous-enforcement safety filter guarantees forward invariance. The arithmetic overhead remains a fixed constant factor, strictly independent of state dimension or constraint count. We extend this framework to arbitrary $\min$/$\max$ Boolean compositions and systems of higher relative degree, validating the computational scaling on three physical examples.
Active Learning for Optimal Experimental Design in Machine Learning-Based Building Energy System Identification
Machine learning (ML) techniques have been commonly adopted to identify the dynamics of building energy systems (BESs), owing to their flexibility relative to first-principles, physics-based modeling approaches. Beyond the choice of ML architecture, the quality of the training data plays an essential role in the resulting model performance. Optimal experimental design (OED), realized in this work through active learning (AL), determines which experiments to conduct in order to collect informative data, rather than relying on standard approaches such as uniformly random sampling. This paper proposes a systematic comparison of OED via AL for building energy system identification, with a particular focus on HVAC thermal dynamics. We investigate fourteen AL techniques across two ML model classes, namely a deterministic feedforward neural network and a stochastic Gaussian process, and classify these techniques into four categories: data space, uncertainty, information gain, and model change. To examine the AL algorithms under realistic conditions, we implement and evaluate them on the high-fidelity building simulator BOPTEST. The results, reported as the root mean square error across multiple test scenarios with varying initial dataset sizes and control input constraints, show that AL-based models generally outperform models trained via passive learning (PL) with uniformly random control inputs, achieving error reductions of up to 54\%, although the magnitude and consistency of this improvement vary across acquisition functions and operating regimes.
GaN Power Devices and Converter Architectures for AI Data Centers: Efficiency, Reliability, and Deployment Pathways
The growth of artificial-intelligence workloads is increasing the electrical and thermal demands on data-center power-delivery systems, making conversion efficiency, power density, and reliability critical design priorities. This review examines how gallium-nitride (GaN) power devices can be matched to specific stages of the grid-to-load conversion chain, including power-factor correction, isolated DC/DC conversion, 48-V intermediate-bus conversion, and point-of-load regulation. Si, SiC, and GaN are compared using converter-relevant metrics, and lateral, vertical, and specialized GaN architectures are evaluated in terms of voltage scalability, switching behavior, reverse conduction, thermal pathways, gate control, and technology maturity. The analysis shows that GaN provides a stage-dependent rather than universal advantage. Commercial lateral GaN HEMTs are particularly effective in high-frequency, low-to-mid-voltage stages, while specialized and hybrid devices support bidirectional operation, normally-off control, extreme conversion ratios, and integration. Vertical GaN remains an emerging option for higher-voltage and higher-power conversion. A quantitative framework links cascaded converter efficiency to electrical-loss reduction, cooling demand, annual facility energy use, and operational carbon emissions. Broad deployment further requires low-parasitic packaging, disciplined gate-drive and EMI co-design, mission-profile reliability qualification, scalable manufacturing, and supply-chain resilience. GaN is therefore best treated as a stage-specific system lever whose value depends on coordinated device, topology, package, and thermal co-design.
Input Convex Neural Network as a Surrogate in Stability-Constrained Optimization for IBR-dominated Power Systems
Input convex neural networks (ICNNs) are increasingly used as surrogates for stability indices and embedded as constraints in power-system optimization. This letter clarifies two recurring formulation limitations that can negate ICNN convexity benefits: (i) applying generic Big-$M$ mixed-integer reformulations introduces auxiliary binaries that are unnecessary for enforcing ICNN sublevel constraints; and (ii) reversing the stability inequality transforms a convex sublevel set into a generally nonconvex superlevel set, invalidating global-convergence guarantees of cut-based methods. After clarifying the limitations, we provide (i) an exact LP-based epigraph reformulation for ReLU-ICNNs, (ii) an outer-approximation scheme with global guarantees under the sublevel convention, and (iii) a feasibility-preserving inner-approximation scheme for the superlevel convention, with simulations on IEEE 14- and 118-bus unit commitment instances.
PRISM: Efficient and Locally Optimal Probabilistic Planning with Reachability Guarantees
Belief-space planning under motion uncertainty and state and control constraints remains a fundamental challenge, largely due to the difficulty of establishing reachability guarantees in constrained belief spaces. Existing constrained belief-space planners rely on sampling to construct multi-query belief roadmaps and explicitly find feasible trajectories between sampled nodes to establish reachability. These methods often struggle to cover the belief space or use robust control techniques that improve coverage at the cost of indirect, high-cost trajectories; they also lack finite-time or finite-memory completeness guarantees. We propose PRISM, a multi-query motion planning algorithm for belief spaces with state and control constraints that targets both high coverage and low cost. We present a new result on controllability of the state covariance under constraints, which is used by PRISM to decompose belief-space planning into deterministic mean planning and covariance shrinking. PRISM further includes an online local optimization method that reduces the cost of feasible belief-space trajectories. Under mild assumptions on the start and goal distributions, we prove that PRISM guarantees full coverage (i.e. completeness) despite actuator and obstacle constraints. In challenging simulated scenarios, PRISM achieves substantially higher roadmap coverage than state-of-the-art belief-space planning methods while producing trajectories with lower mean cost and cost variance. For example, PRISM achieves 100% coverage in easy and medium-difficulty scenarios, and, in the hardest scenario, which violates PRISM's coverage assumptions, it still achieves 97-100% coverage, while all other methods achieve less than 45%.
When Agents Meet Electric Bus Fleet Operations: Pricing Behavior, Trade-offs, and Policy Implications in an Aggregator Framework
Agentic systems are changing how complex operational tasks are coordinated, introducing a new paradigm for connecting heterogeneous data sources and automating processes. Electric bus fleets provide a relevant test case. Their operation requires continuous coordination between service reliability, battery state-of-charge, charger availability, electricity prices, route-energy uncertainty, and vehicle-to-grid (V2G) opportunities. This paper proposes an agentic aggregator framework that streamlines this decision environment by coupling an optimization-based electric bus scheduling model with supervisory agents for disturbance detection, tariff adaptation, and schedule evaluation. The optimization core enforces physical feasibility across routes, chargers, batteries, and V2G exchanges, while the agentic layer interprets changing operating conditions, triggers real-time re-optimization when needed, and defines how flexibility value is allocated between the aggregator and the public transport operator (PTO). A realistic depot case study evaluates day-ahead and real-time operations under profit-based and operation-based coordination modes, considering service delays, route-energy deviations, electricity price shocks, and combined disturbances. The results show that agentic aggregation can support adaptive fleet-grid coordination by maintaining feasible schedules, activating re-optimization selectively, and improving the use of charging and V2G flexibility. However, they also reveal a critical trade-off: the same agentic capability that reduces operational complexity can extract value from the PTO when configured around profit-oriented pricing. These findings suggest that agentic aggregators can become useful for managing electric bus V2G operations, but their deployment in public-fleet contexts requires transparent coordination modes, auditable tariff-setting, and explicit value-sharing rules.
pysib: An Open-Source Python Toolbox for Linear System Identification
Discrete-time polynomial input--output models (ARX, ARMAX, OE, and Box--Jenkins) are usually estimated by prediction-error methods, but for OE, ARMAX, and BJ the finite-sample criterion is nonconvex: the estimate a user actually obtains is set by the initialization and the optimization procedure, not only by the asymptotic theory. This article documents the dedicated optimization strategy behind pysib, an open-source Python toolbox for SISO polynomial system identification. The strategy consists of an ARX-based initialization, a smoothed-gradient phase, an incremental Gauss--Newton refinement, and filtered continuation interpreted as cost-function shaping. This strategy produced the filtered-continuation results reported by the author in earlier work but had not previously been described or released; it is given here in full and as open Python software, with a common five-polynomial representation, shared prediction and simulation routines, and the scripts and archived release needed to reproduce the experiments. On a moderate-noise OE benchmark the strategy returns estimates far more concentrated around the true parameters than a general-purpose nonlinear-programming solver, and on a harder nonconvex benchmark filtered continuation raises the success rate from 60% to 100%.
comment: 8 pages, 2 figures, 6 tables
Bayesian Changepoint Detection for Smart Sensing of Battery Degradation: Cycle-Level Health Indicators and PyMC Implementation
Reliable detection of the onset of accelerated degradation is central to safe and cost-efficient operation of lithium-ion batteries. This paper presents a Bayesian single-changepoint model applied to a simple but physically meaningful cycle-level health indicator (HI), defined as the ratio of charge time to discharge time. The indicator is computed directly from voltage-current telemetry typically available in battery management systems (BMS), without access to raw waveforms. The changepoint model is implemented in PyMC using Hamiltonian Monte Carlo and produces posterior distributions for onset time and pre/post-degradation slopes, together with posterior predictive checks. Experiments on an open 18650-cell remaining useful life (RUL) dataset show consistent midlife changepoints with narrow highest-density intervals. The formulation is lightweight, interpretable, and amenable to smart-sensing deployment on embedded BMS platforms.
comment: This work has been submitted to the IEEE for possible publication
Scalable Reachability Analysis of Linear Continuous Systems with Property-Driven Time-Step Adaptation
We study safety verification for linear time-invariant systems with bounded inputs in continuous time. The standard approach reduces to a reachability analysis in two steps: first discretize time and then apply a forward analysis in the discretized system. Existing algorithms use either a fixed time step or an adaptive time step that changes based on the approximation error compared to the underlying continuous system. In this paper, we present an efficient reachability algorithm that adapts the time step based on a given safety property. Essentially, our algorithm makes the largest possible time step such that it can still prove safety. For this approach to be scalable in practice, we discuss several optimizations such as avoiding the repeated expensive calculation of the matrix exponential during discretization and a careful balance how we tame the approximation error stemming from the states and the inputs. This allows our algorithm to yield a moderate approximation error even when using a large time step, thus requiring much fewer steps than prior algorithms. We demonstrate the effectiveness and scalability on the large-scale SLICOT benchmark suite, where our algorithm consistently outperforms other state-of-the-art approaches.
comment: QEST+FORMATS 2026
Health feature extraction from battery energy storage system field fault data
Health monitoring methods are critical for lithium-ion battery modules connected to the grid to prevent faults that can lead to catastrophic events. However, assessing the health of cells in modules from their operational data presents challenges including variable operating conditions, which directly confound health features, and sparse sensing in the modules, particularly within cells in parallel, which prevents observing critical states of individual cells. Here, we present a framework for extracting and calibrating health features for battery modules from their operational data to identify discriminative features for separating faulty parallel-connected cell groups within the modules. We applied this framework to operational data from 25 commercial grid-connected lithium-ion Battery Energy Storage System (BESS) modules. Each module consisted of 14 series-connected parallel groups, one of which was confirmed as faulty via post-mortem investigation; in total, the dataset included 25 faulty and 325 non-faulty cell groups. A statistical evaluation of these calibrated features demonstrated that group-level capacity, capacity degradation rate, and dV/dQ peak heights separate faulty parallel-connected cell groups within the modules with statistical significance (p<0.05). Conversely, group internal resistance did not (p>0.05), indicating that increased resistance was not a primary characteristic of the faults in this dataset. These findings challenge the exclusive reliance on resistance features for fault detection. The observed feature signatures suggest potential failure mechanisms, furthering the understanding of fault behavior in lithium-ion battery modules during field operation. More importantly, this work demonstrates a framework for robustly monitoring the health of cells in lithium-ion battery modules under real-world operations.
Feasibility-Aware Security-Constrained Unit Commitment via Hybrid Soft Actor-Critic with Quantum-Sampled Features
Security-constrained unit commitment (SCUC) couples binary commitment, economic dispatch, reserves, and network security over a multiperiod horizon, which makes an exact solution expensive at realistic system sizes. This paper proposes a three-layer hybrid framework in which a Bernoulli hybrid soft actor-critic (HSAC) policy proposes hourly commitments, a quantum-sampled auxiliary channel augments the state, and a native SCUC mixed-integer linear program recovers dispatch and security variables after only a limited subset of commitment binaries is enforced. The method is therefore solver-compatible rather than an end-to-end replacement for exact optimization. We formalize the SCUC-to-reinforcement-learning interface, derive the temporal coverage induced by the fixed cap, and conduct representative experiments on the 14-, 57-, and 118-bus cases. The results show stable, low-cost recovery in the 14-bus case; a very low screen-rejection rate in the 57-bus case, consistent with learned feasibility generalization under fixed intertemporal SCUC constraints; and a clear coverage bottleneck in the 118-bus case once the enforcement cap no longer spans a complete commitment period. The 118-bus case runtime traces nevertheless remain tightly clustered for accepted episodes, indicating that the policy still captures a repeatable recovery pattern across most episodes. The study, therefore, identifies the dominant limitation of the current implementation as the amount of useful commitment information that reaches the recovery model under an exploratory Bernoulli actor and a small enforcement cap, and shows how that limitation governs scalability.
comment: To appear in IEEE SmartGridComm 2026
A Bilevel Framework for Data Center-Grid Coordination with DLMPs in Unbalanced Three-Phase Distribution Systems
This paper proposes a grid-aware coordination framework between data centers and distribution grids using a DLMP-based bilevel optimization model. The data center aggregator (DCA) determines active power demand in response to distribution locational marginal prices (DLMPs), while the distribution system operator (DSO) solves a network-constrained optimal power flow problem to determine DLMPs in an unbalanced three-phase system. The model incorporates both active and reactive power consumption of data centers to evaluate their impacts on voltage regulation and phase imbalance. To mitigate adverse network effects, two operating cases are analyzed: without reactive power compensation and with static var generator (SVG)-based compensation. The proposed approach is validated on the IEEE 37-bus unbalanced distribution test system. Simulation results show that DLMP-based coordination captures economically efficient data center operation, and phase- and location-dependent network conditions, while SVG-based compensation improves voltage profiles and reduces phase unbalance.
Racing a Wheeled Quadruped: Active Load Transfer Mitigation via Model Predictive Control
This paper presents a hierarchical control framework using model predictive control (MPC) and reinforcement learning (RL) for active roll control to manage lateral load transfer during autonomous racing of a wheeled quadruped. The framework integrates offline time-optimal raceline generation, an online MPC planner that actively minimizes the lateral Load Transfer Ratio (LTR), and a low-level, whole-body RL policy deployed directly onto the robot's 16 actuators. The MPC is based on a vehicle dynamics bicycle model of the Unitree Go2-W platform. The robot's leg actuators act as active suspension where knee joints generate anti-roll torque to bank into turns. Physical track experiments demonstrate that active roll control reduces mean LTR by up to 44%, improves the fastest lap time by 8.7%, and boosts peak lateral acceleration capability by 21.3% to 1.98 $m/s^2$, maintaining robust high-speed stability beyond the range of a non-tilting baseline controller. Supplementary code and video can be found at https://github.com/meisman-ucb/go2w-roll-control-mpc
comment: Accepted to the 17th International Symposium on Advanced Vehicle Control (AVEC 2026), 7-11 September 2026, Tsukuba, Japan
Fiber Bragg grating-based acoustic sensing system enabled by ML-trained, sub-picometer-tunable hybrid III-V/SiN lasers
Distributed acoustic emission (AE) sensing is critical for early detection of structural degradation, yet conventional electrical sensors are difficult to scale and fiber-based approaches are limited by interrogation complexity and resolution. Here, we report an intelligent fiber Bragg grating (FBG) sensing system enabled by machine learning (ML)-trained hybrid III-V/SiN tunable lasers that achieve uniform, mode-hop-free, sub-picometer wavelength control. A supervised gradient-descent algorithm is used to learn the nonlinear electro-thermal tuning space of Vernier-based external-cavity lasers, enabling continuous tuning with <0.1 pm resolution and <0.5 dB power variation. This capability allows precise alignment to FBG reflection slopes for high-sensitivity acoustic detection. We demonstrate a four-laser interrogation system monitoring 16 FBG sensors distributed across multiple metallic structures, operating over a 35 nm wavelength span. The system autonomously identifies sensor resonances, dynamically tracks spectral shifts, and reconfigures interrogation wavelengths in response to localized acoustic events. Using pencil-lead break tests as calibrated AE sources, we show simultaneous multi-channel detection and adaptive spatial localization. The combination of narrow linewidth (<10 kHz), wide tunability, and ML-driven calibration enables robust, scalable, and high-resolution sensing. This approach establishes a pathway toward fully autonomous, distributed photonic sensing networks for real-time structural health monitoring.
AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems
The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.
Optimal Sampling and Actuation for Real-Time Monitoring of Markov Sources
This paper studies efficient data management and timely information dissemination for real-time monitoring of an N-state Markov process, with the objective of enabling accurate state estimation and reliable actuation decisions. We analyze the real-time reconstruction error and the Age of Incorrect Information (AoII), and derive closed-form expressions for their time-averaged values under several sampling and transmission policies. We then formulate and solve constrained optimization problems to minimize the time-averaged reconstruction error and the average AoII under a time-averaged sampling frequency constraint. The resulting optimal sampling and transmission policies are compared to identify the conditions under which each policy is most effective. We further show that directly using the reconstructed state for actuation can degrade system performance, especially when the receiver is uncertain about the state estimate or when actuation is costly. These findings reveal that accurate state estimation alone does not necessarily lead to effective actuation, highlighting the importance of incorporating uncertainty into the decision-making process. To address this issue, we introduce a cost function, termed the Cost of Actions under Uncertainty (CoAU), which characterizes correct and incorrect actuation decisions under receiver-side uncertainty. We propose a randomized actuation policy and derive a closed-form expression for the probability of a correct actuation decision, defined as the event in which the CoAU equals zero. Finally, we formulate an optimization problem to find the optimal randomized actuation policy that maximizes this probability. The results show that the resulting policy substantially reduces incorrect actuator actions.
Accelerated Decentralized Constraint-Coupled Optimization: A Dual$^2$ Approach
In this paper, we focus on a class of decentralized constraint-coupled optimization problem: $\min_{x_i \in \mathbb{R}^{d_i}, i \in \mathcal{I}; y \in \mathbb{R}^p}$ $\sum_{i=1}^n\left(f_i(x_i) + g_i(x_i)\right) + h(y) \ \text{s.t.} \ \sum_{i=1}^{n}A_ix_i = y$, over an undirected and connected network of $n$ agents. Here, $f_i$, $g_i$, and $A_i$ represent private information of agent $i \in \mathcal{I} = \{1, \cdots, n\}$, while $h$ is public for all agents. Building on a novel dual$^2$ approach, we develop two accelerated algorithms to solve this problem: the inexact Dual$^2$ Accelerated (iD2A) gradient method and the Multi-consensus inexact Dual$^2$ Accelerated (MiD2A) gradient method. We demonstrate that both iD2A and MiD2A can guarantee asymptotic convergence under a milder condition on $h$ compared to existing algorithms. Furthermore, under additional assumptions, we establish linear convergence rates and derive significantly lower communication and computational complexity bounds than those of existing algorithms. Several numerical experiments validate our theoretical analysis and demonstrate the practical superiority of the proposed algorithms.
A Data-Driven Approach To Preserve Safety and Reference Tracking for Constrained Cyber-Physical Systems Under Network Attacks
This paper proposes a worst-case data-driven control architecture capable of ensuring the safety of constrained Cyber-Physical Systems under cyber-attacks while minimizing, whenever possible, potential degradation in tracking performance. To this end, a data-driven robust anomaly detector is designed to detect cyber-attack occurrences. Moreover, an add-on tracking supervisor module allows safe open-loop tracking control operations in case of unreliable measurements. On the plant side, a safety verification module and a local emergency controller are designed to manage severe attack scenarios that cannot be handled on the controller's side. These two modules resort to worst-case reachability and controllability data-driven arguments to detect potential unsafe scenarios and replace, whenever strictly needed, the tracking controller with emergency actions whose objective is to steer the plant's state trajectory in a predefined set of admissible and safe robust control invariant region until an attack-free scenario is restored. The effectiveness of the proposed solution has been shown through a simulation example.
comment: Preprint of a journal manuscript submitted to the European Journal of Control
Gray-Box Nonlinear Feedback Optimization
Feedback optimization enables autonomous optimality seeking of a dynamical system through its closed-loop interconnection with iterative optimization algorithms. Among various iteration structures, model-based approaches require the input-output sensitivity matrix of the system to construct gradients, whereas model-free approaches eliminate this need by estimating gradients from real-time objective evaluations. These approaches offer complementary benefits in sample efficiency and accuracy against model mismatch, i.e., sensitivity errors. To achieve balanced closed-loop performance, we propose a gray-box feedback optimization controller, featuring systematic incorporation of approximate sensitivities into model-free updates via a tunable convex combination. We provide unified performance characterizations covering different approaches. We elucidate how cumulative sensitivity errors (model-based) and variances due to stochastic exploration (model-free) shape the closed-loop behavior and induce a trade-off between iteration and dimensional dependence. The proposed controller retains sample efficiency and provable (local) optimality for nonconvex problems despite inaccurate sensitivities. We further develop and characterize a running gray-box controller that handles constrained time-varying problems with changing objectives and steady-state input-output maps.
comment: published in IEEE Transactions on Automatic Control, 2026
Stable but Unsafe: Agent-Driven Cyber-Physical Systems Under Gain Manipulation Attacks
AI agents are increasingly being connected to Cyber-Physical Systems (CPS) to generate or modify control-relevant parameters at runtime, including feedback gains, cost weights, and reference signals. These updates create a parameter channel: a pathway between the agent and the controller that is structurally distinct from classical sensor and actuator channels. Among the parameters carried by this channel, feedback gains are especially high-leverage: under linear state feedback, a single gain matrix determines closed-loop eigenvalue placement for the entire system. Consequently, malicious gain updates can reshape the closed-loop dynamics without producing the signal-level inconsistencies targeted by residual-based monitors. We formalize this attack surface through a three-axis attacker model and a taxonomy of Gain Manipulation Attacks (GMA). Two impact classes are identified: stability-margin erosion under sustained gain drift and transient amplification under one-shot gain replacement. We demonstrate that an attacker can drive the system past its safe physical operating limits while maintaining mathematical stability, proving that stability verification alone is insufficient to bound the physical impact. Using Bauer--Fike eigenvalue bounds and the Kreiss matrix theorem, we derive exact stealthiness conditions and worst-case impact certificates for each class. Finally, we propose preliminary detection directions and validate our framework through a vehicle lateral dynamics case study.
comment: 6 pages, 2 figures, 2 tables. Submitted to IEEE L-CSS for possible publication
Certified Robust Invariant Polytope Training in Neural Controlled ODEs
We propose a framework for training neural network controllers with certified robust forward invariant polytopes. First, we parameterize a family of lifted control systems in a higher dimensional space, where the original neural controlled system evolves on an invariant subspace of each lifted system. We use interval analysis and neural network verifiers to further construct a family of lifted embedding systems, carefully capturing the knowledge of this invariant subspace. If the vector field of any lifted embedding system satisfies a sign constraint at a single point, then a certain convex polytope of the original system is robustly forward invariant. Treating the neural network controller and the lifted system parameters as variables, we propose an algorithm to train controllers with certified forward invariant polytopes in the closed-loop control system. Through two examples, we demonstrate how the simplicity of the sign constraint allows our approach to scale with system dimension to over $50$ states, and outperform state-of-the-art Lyapunov-based sampling approaches in runtime.
Learning Certified Neural Network Controllers Using Contraction and Interval Analysis
We present a novel framework that jointly trains a neural network controller and a neural Riemannian metric with rigorous closed-loop contraction guarantees using formal bound propagation. Directly bounding the symmetric Riemannian contraction linear matrix inequality causes unnecessary overconservativeness due to poor dependency management. Instead, we analyze an asymmetric matrix function $G$, where $2^n$ GPU-parallelized corner checks of its interval hull verify that an entire interval subset $X$ is a contraction region in a single shot. This eliminates the sample complexity problems encountered with previous Lipschitz-based guarantees. Additionally, for control-affine systems under a Killing field assumption, our method produces an explicit tracking controller capable of exponentially stabilizing any dynamically feasible trajectory using just two forward inferences of the learned policy. Using JAX and \immrax{} for linear bound propagation, we apply this approach to a full 10-state quadrotor model. In $<$10 minutes of post-JIT training, we simultaneously learn a control policy $π$, a neural contraction metric $Θ$, and a verified 10-dimensional contraction region $X$.
Safe Learning Control with Optimality and Stability Guarantees
Merely pursuing performance may adversely affect safety, while a conservative policy for safe exploration will degrade the performance. How to guarantee both safety and performance in learning-based control problems is an interesting yet challenging issue. This paper aims to enhance system performance with a safety guarantee by solving reinforcement learning (RL)-based optimal control problems for nonlinear systems subject to high-relative-degree state constraints and unknown time-varying disturbance/actuator faults. A new type of control barrier functions (CBFs), termed high-order reciprocal-based control barrier function, is proposed to handle high-relative-degree constraints, which extends the design of CBFs to enforce robust safety without knowing the disturbance bound. The concept of gradient similarity is proposed to quantify the relationship between safety and performance. Finally, gradient manipulation and adaptive mechanisms are introduced in the model-based safe RL framework to enhance the performance with a safety guarantee. Two simulation examples illustrate the efficacy of the proposed algorithms.
comment: 16 pages, 6 figures
Control of neural field equations with step-function inputs
Wilson-Cowan and Amari-type models capture nonlinear neural population dynamics, providing a fundamental framework for modeling how sensory and other exogenous inputs shape activity in neural tissue. We study the controllability properties of Amari-type neural fields subject to piecewise/constant-in-time inputs. The model describes the time evolution of the polarization of neural tissue within a spatial continuum, with synaptic interactions represented by a convolution kernel. We study the synthesis of piecewise/constant-in-time inputs to achieve two-point boundary-type control objectives, namely, steering neural activity from an initial state to a prescribed target state. This approach is particularly relevant for predicting the emergence of paradoxical neural representations, such as discordant visual illusions that occur in response to overt sensory stimuli. We first present a control synthesis based on the Banach fixed-point theorem, which yields an iterative construction of a constant-in-time input under minimal regularity assumptions on the kernel and transfer function; however, it exhibits practical limitations, even in the linear case. To overcome these challenges, we then develop a generic synthesis framework based on the flow of neural dynamics drift, enabling explicit piecewise constant and constant-in-time inputs. Extensive numerical results in one and two spatial dimensions confirm the effectiveness of the proposed syntheses and demonstrate their superior performance compared to inputs derived from naive linearization at the initial or target states when these states are not equilibria of the drift dynamics. By providing a mathematically rigorous framework for controlling Amari-type neural fields, this work advances our understanding of nonlinear neural population control with potential applications in computational neuroscience, psychophysics, and neurostimulation.
comment: The last version of the article was published in the SIAM Journal on Control and Optimization
Scalable Design of Attack-Resilient Controllers for Positive Systems
This paper proposes a framework for secure and resilient controller design for positive systems against cyber-attacks. In particular, we consider a network-controlled system where an adversary injects false data into the actuator channels to increase the control cost (performance measure) while penalizing the attack effort and subject to state-dependent constraints. Using a minimax formulation, we analyze the worst-case performance loss caused by such adversaries, which is given by the solution of a difference equation, and an algebraic equation when the time horizon is infinite. We show that the optimal attack policy, among possible nonlinear policies, is linear. Despite the lack of explicit stealthiness constraints, we also show that when the measured output has an unstable zero which is not an unstable zero of the performance measure, the attacks can induce unbounded performance degradation. The proposed framework is also extended to systems with model uncertainty. Numerical examples illustrate the results and demonstrate how tools from positive systems and linear regulator theory can be used to mitigate cyber-attacks with low computational effort.
comment: Accepted for publication in L-CSS and under review for presentation in CDC 2026
Asymmetry-Aware Routing for Industrial Multimodal Monitoring: A Diagnostic Framework
Multimodal fusion is the default approach for combining heterogeneous sensor streams in industrial monitoring, yet no systematic method exists for determining \textit{when fusion degrades rather than improves} detection performance. We present an \textbf{Asymmetry-Aware Routing Framework} -- a three-step diagnostic procedure (unimodal performance gap, gate weight attribution, modality corruption testing) with formal decision criteria -- that routes multimodal systems toward the appropriate fusion strategy before deployment. We validate the framework on three datasets spanning two routing outcomes: (1)~the OHT/AGV industrial dataset (thermal + sensors, 13{,}121 samples), where the framework correctly identifies severe asymmetry (gap ratio 3.1$\times$) and recommends \textsc{cascade}; (2)~a chain conveyor fault detection scenario (audio + vibration), where moderate asymmetry leads to a \textsc{fuse} recommendation with positive fusion benefit; and (3)~the CWRU bearing dataset, providing controlled validation in both directions. Threshold sensitivity analysis across all three datasets shows that the framework's recommendations are robust to threshold perturbation, with correct routing maintained over a wide parameter plateau. Comparison against simpler diagnostics (gap ratio alone) reveals that Step~1 alone is ambiguous for moderate-asymmetry cases, demonstrating the necessity of the full protocol for reliable routing decisions.
comment: A subsequent extension of this analysis to a larger corpus found that the heterogeneity predictor is collinear with the industrial-versus-general domain split, so the corresponding effect is not identifiable on the available data
StochasticBarrier.jl: A Toolbox for Stochastic Barrier Function Synthesis
We present StochasticBarrier.jl, an open-source Julia-based toolbox for generating Stochastic Barrier Functions (SBFs) for safety verification of discrete-time stochastic systems with additive Gaussian noise. StochasticBarrier.jl certifies linear, polynomial, and piecewise affine (PWA) systems. The latter enables verification for a wide range of system dynamics, including general nonlinear types. The toolbox implements a Sum-of-Squares (SOS) optimization approach, as well as methods based on piecewise constant (PWC) functions. For SOS-based SBFs, StochasticBarrier.jl leverages semi-definite programming solvers, while for PWC SBFs, it offers three engines: two using linear programming (LP) and one based on gradient descent (GD). Benchmarking StochasticBarrier.jl against the state-of-the-art shows that the tool outperforms existing tools in computation time, safety probability bounds, and scalability across over 30 case studies. Compared to its closest competitor, StochasticBarrier.jl is up to four orders of magnitude faster, achieves significant safety probability improvements, and supports higher-dimensional systems.
A Cycle-Based Solvability Condition for Real Power Flow Equations
Certifying power flow solvability is important for reliable power system operations under volatile operating conditions, but solving power flow equations repeatedly can be costly and may encounter convergence issues. In this paper, we develop an explicit cycle-based solvability condition for the lossless real power flow equations on meshed networks. We decompose every feasible nodal balance solution into a particular flow plus a cycle flow correction vector. The power flow problem is then reduced to enforcing edge-wise feasibility and cycle consistency. We show that the cycle consistency function is strongly monotone and is the gradient of a strongly convex energy function. By exploiting these properties, we derive an explicit condition for the existence and uniqueness of a power flow solution with bounded angle difference. The resulting condition is invariant under the choice of cycle basis and can be verified through simple algebraic computations. Numerical results on standard test systems show that the proposed condition is significantly less conservative than existing sufficient conditions and closely approximates true loading limits.
comment: To appear in IEEE Control Systems Letters
Robotics
InSight: Self-Guided Skill Acquisition via Steerable VLAs
Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.
comment: Project website: https://insight-vla.github.io
Vision-Language Model Reasoning for Contextual Semantic Mapping in Intralogistics
Autonomous mobile robots operating in intralogistics environments rely on geometric maps for localization and navigation, but lack semantic understanding of objects and their contextual properties. We present a contextual semantic mapping pipeline that combines SLAM-based geometric mapping, SAM-based instance segmentation, instance clustering, and VLM multi-view reasoning to produce a contextual semantic map representation encoding geometric structure, object class, and object movability. By aggregating observations across multiple viewpoints and querying a VLM in a zero-shot, open-vocabulary setting, the pipeline infers contextual object properties--here demonstrated through movability--without requiring task-specific training or predefined object categories. We evaluate three VLMs under two prompting strategies and conduct a component-wise analysis of the pipeline. The proposed pipeline achieves 98.93 % mIoU for semantic classification and 89.17 % mAcc for object movability estimation. Component analysis identifies VLM reasoning as the primary bottleneck for contextual understanding and instance clustering as the main limitation for panoptic performance. The resulting semantic map supports context-aware filtering and robust navigation in dynamic intralogistics environments.
comment: Accepted for publication at IEEE ETFA 2026
Compact Object-Level Representations with Open-Vocabulary Understanding for Indoor Visual Relocalization
Indoor visual relocalization plays a critical role in emerging spatial and embodied AI applications. However, prior research was predominantly devoted to low-level vision schemes, struggling to perceive scene semantics and compositions, which limits both interpretability and applicability. In this paper, we explore the issue of how to organize rich object information in a scene, including semantics, layout, and geometry, into a structured map representation, thereby utilizing object units exclusively to drive the camera relocalization task. To this end, we propose OpenReLoc, a camera relocalization system designed to provide scene understanding and accurate pose estimation capabilities. Leveraging recent foundation models, we first introduce a multi-modal mechanism to integrate open-vocabulary semantic knowledge for effective 2D-3D object matching. Additionally, we design object-oriented reference frames as position priors, paired with a reference frame selection strategy based on the Distance-IoU (DIOU), enabling extension to scalable scenes. Moreover, to ensure stable and accurate pose optimization, we also propose a dual-path 2D Iterative Closest Pixel loss guided by object shape. Experimental results demonstrate that OpenReLoc achieves superior relocalization recall and accuracy across various datasets. Our source code will be released upon acceptance.
comment: Accepted by RA-L 2026
World Value Models for Robotic Manipulation
Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.
comment: preprint
TACTFUL: Tactile-Driven Exploration For Object Localization and Identification in Confined Environments IROS 2026
Humans effortlessly locate and identify objects by touch alone, even without vision. In contrast, robotic systems rely heavily on vision and struggle with autonomous tactile exploration and object identification. We present TACTFUL, a vision-free tactile exploration framework that enables a multi-fingered robot to autonomously explore confined workspaces, discover objects through contact, and identify them via tactile reconstruction. Trained entirely on real hardware without simulation, our system learns a single policy that balances global workspace exploration with local surface refinement through a dynamic reward schedule. Our results demonstrate that tactile sensing, when paired with structured learning, can serve as an effective primary modality for object-level reasoning, achieving 77% success with 0.015 m average reconstruction error and outperforming baseline approaches on real-world objects.
comment: IROS 2026
Beyond Monotonic Progress: Retry-Supervised Value Learning for Robot Imitation
Human demonstrations for robot imitation learning often contain mistakes and corrective behaviors, such as imprecise grasps, object misalignment, unstable contact, and repeated attempts. While these segments are commonly treated as noisy or suboptimal data, they provide valuable evidence about when execution deviates from a desirable path and how task feasibility can be restored. However, existing reward and value models often rely on monotonic progress assumptions, which capture coarse task advancement but may overlook local execution errors and corrective behaviors in imperfect demonstrations. In this work, we propose ReTVL (ReTry-Supervised Value Learning), a framework for learning mistake-sensitive value functions from mixed-quality robot demonstrations by leveraging retry events as sparse supervision. ReTVL captures the local degradation-and-recovery structure around mistakes by combining global progress calibration with local pairwise preference learning induced by sparsely annotated retry keypoints. The learned value model is then used to reweight demonstration chunks for downstream behavior cloning, reducing the influence of harmful execution errors while preserving useful corrective behaviors. Experiments on real-robot manipulation tasks show that ReTVL produces more fine-grained value estimates than progress-based baselines and improves imitation learning from imperfect demonstrations.
Parallel Dynamic Programming for Conic Linear Quadratic Control
Linear Quadratic (LQ) control problems are at the heart of linear control theory and Model Predictive Control (MPC). While performant, standard approaches to solving such problems are inherently serial, limiting real-time scalability despite the parallel computing power available on modern multi-core CPUs. Contributing to addressing this challenge and motivated by ``divide and conquer'' strategies, we present a parallel-in-time approach that solves computationally demanding conic optimal control problems through the use of the alternating direction method of multipliers (ADMM). In particular, we formulate the inner primal update of ADMM as an LQ problem and split the reformulated problem along the time horizon. This enables us to derive a variant of the Riccati recursion using dynamic programming to solve each subproblem in parallel. Numerical benchmarks on two real-world applications demonstrate as much as a 5x speedup compared to existing related approaches on multi-core CPU hardware.
comment: This paper was accepted for presentation at the IFAC World Congress 2026 (IFAC WC 2026)
Optimization-based Safe Trajectory Planning for Autonomous Ground Vehicle in Multi-Floor Scenarios
The development of trajectory planning strategies for autonomous ground vehicles (AGVs) represents a prevailing research interest within the domain of intelligent transportation systems. This paper introduces a trajectory planning framework tailored for multi-floor scenarios. The framework consists of two main modules: the task planning module and the trajectory planning module. The task planning module involves a strategic selection phase, where a task planning strategy based on generalized voronoi diagrams (GVD) and multi-objective algorithms is proposed to select the floor exits for each floor. The trajectory planning module utilizes optimization-based methods to generate high-quality trajectories, and a warm-started hierarchical planning framework is designed to ensure rapid convergence. Additionally, for handling complex obstacle constraints, a correlation constraint calculation method is designed for reducing obstacle constraints in trajectory planning. Finally, the feasibility and effectiveness of the proposed framework are verified through simulations.
ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos ICRA 2026
Deploying robots in unstructured real-world environments needs accurate, interactive models of the objects. Constructing these models at scale remains a critical bottleneck for robotic system integration. We present ArtiTwinSplat, a framework that automatically constructs articulated, photo-realistic digital twins of objects directly from RGB-D videos, requiring no CAD models, simulation assets, or manual annotations. Our method is built on 3D Gaussian Splatting that preserve geometric fidelity and photometric realism, coupled with an unsupervised articulation discovery pipeline that recovers part structure and joint kinematics from observed motion alone. With tracking and optimization stages our method provides stable, queryable digital twins that support real-time rendering, viewpoint control, and interactive manipulation. Unlike prior methods confined to simulation, ArtiTwinSplat operates directly on real-world observations and produces twins that are immediately usable by downstream robot planning and learning systems. This method offers a practical, scalable pathway toward digital twin construction, lowering the integration barrier for articulated object manipulation in embodied AI and human-robot collaboration contexts.
comment: Presented at the ICRA 2026 Workshop on Advances and Challenges in AI-Driven Automation and Robotic System Integration with Digital Twins, Vienna, June 2026
Enabling Robust Cloth Manipulation via Inference-Time Simulator-in-the-Loop Refinement
Simulator-in-the-loop optimization offers a promising inference-time mechanism for robot manipulation. It uses a physical simulator as a backend rollout engine to evaluate candidate trajectories in parallel and refine nominal actions online, a paradigm proven effective in rigid-body manipulation where state and contact are relatively tractable. We bring this paradigm to real-world cloth manipulation from a single RGB input through three pillars. (i) We design a scalable synthetic-data generation and inference-time rollout pipeline built on FLASH, a deformable-object simulator that provides a practical balance among physical fidelity, numerical stability, and rollout efficiency. (ii) We develop a real-to-sim module, trained purely on synthetic data, that maps a single RGB observation to simulation-compatible cloth state by fusing pretrained visual features with learnable canonical tokens. (iii) We perform online planning by coupling a sparse-mesh rollout backend with prior-guided MPPI, anchored at an offline-distilled policy trajectory, preserving manipulation-relevant deformation and contact while enabling sufficient parallel rollout batches. Real-robot experiments show higher success rates and stronger robustness than baseline methods.
Explaining Failures of Cyber-Physical Systems with Actual Causality ICRA 2026
Modern autonomous Cyber-Physical Systems (CPSs), such as self-driving cars, face increasingly complex demands, and yet are expected to act reliably. The black-box nature often characterizing such systems, especially those relying on neural components, makes it impossible to fully verify the system behavior prior to deployment. Unfortunately, unexpected failures-when the system does not comply with its specification-are inevitable and may have catastrophic implications. To improve trust in the system and facilitate future mitigation after a failure occurs, it is important to try to derive an explanation for the unexpected system behavior. This paper introduces the novel concept of leveraging the framework of actual causality for CPS failure explanation. Up until now, this framework was only used to derive explanations in the context of simple systems, such as image classifiers. This paper addresses the theoretical gaps and provides the guidance needed to allow for correct explanation derivation in the CPS domain. Beyond the theoretical contribution, the paper presents two novel, practical, system-agnostic explanation derivation algorithms, allowing to prioritize either explanation optimality or derivation efficiency. The approach is demonstrated and evaluated in the context of a neural-network-controlled autonomous car, designed to avoid collisions.
comment: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Decentralized Pose Graph Riemannian Optimization for Object-based Multi-Robot SLAM
Pose graph optimization (PGO) is a key back-end component for state estimation in networked multi-robot simultaneous localization and mapping (SLAM). In object-based multi-robot SLAM, the problem becomes more tightly coupled because robots must jointly estimate both their trajectories and the poses of persistent objects observed by multiple agents. Existing decentralized solutions often assume that the communication graph closely matches the physical interaction topology, which is restrictive in realistic deployments where communication is sparse, intermittent, or time-varying. This paper presents a fully decentralized Riemannian optimization framework for object-based multi-robot PGO that decouples the coupled estimation problem via a consensus mechanism, enabling flexible communication topologies. To improve convergence under limited communication budgets, we further develop a distributed approximate-Newton scheme that exploits local second-order information while operating directly on the SE(d) manifold to preserve geometric consistency, and we establish the convergence to Riemannian first-order stationary points and provide a local condition-number analysis explaining the benefit of approximate second-order information over first-order Riemannian descent. The resulting method reduces iteration count and communication overhead without sacrificing estimation accuracy. Extensive evaluations on public benchmarks, large-scale simulations, and real-world multi-robot experiments demonstrate improved accuracy, runtime efficiency, scalability across network topologies, and robustness to communication failures.
G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models
Vision-language-action (VLA) models have made rapid progress in generalist robot manipulation by harnessing semantic knowledge from pretrained vision-language backbones, but their visual tokens remain grounded in 2D image coordinates rather than the calibrated geometry of the robot's cameras -- a mismatch especially pronounced in multi-camera setups, where views are coupled by known intrinsics and extrinsics yet processed as independent images. We propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of a pretrained VLA without altering its action space or imitation objective, combining intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion. Geometric supervision is provided either from ground-truth point maps when available, or from confidence-gated $π^3$X teacher predictions, requiring no depth sensors or manual annotations. Instantiated on $π_0$, G$^3$VLA yields consistent gains across the LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings, with the largest improvements on spatially and object-sensitive tasks. We further validate on $π_{0.5}$ and GR00T 1.5, with results suggesting that geometric transfer is most effective when geometry-aware tokens have direct access to the action generation pathway. Our project page is at https://sites.google.com/view/g3vla
comment: Submitted to CoRL 2026
FT-WBC: Learning Fault-Tolerant Whole-Body Control for Legged Loco-Manipulation
Legged manipulators combine the mobility of legged platforms with the manipulation capability of robotic arms. However, arm-induced Center-of-Mass shifts and dynamic disturbances make the system more prone to instability under actuator failures, potentially leading to falls, task failures, or safety risks. Existing fault-tolerant control methods mainly focus on locomotion alone, leaving the coupled problem of whole-body stability and arm reachability in fault-tolerant loco-manipulation largely unaddressed. To bridge this gap, we propose FT-WBC, a fault-tolerant loco-manipulation framework for robust whole-body control of legged manipulators under actuator failures. FT-WBC adopts a decoupled upper- and lower-body policy architecture and introduces two key modules: a Fault Estimator (FE) and a Posture Adaptation Module (PAM). The FE predicts faulty joints from lower-body proprioceptive histories, while the PAM uses this fault information to adapt the base posture plan generated by the arm policy, converting potentially unstable posture requests into safe and executable base posture commands. Through this fault-aware posture adaptation mechanism, FT-WBC synthesizes compensatory gaits under actuator failures and preserves as much arm workspace as possible while maintaining whole-body stability. Simulation and real-world experiments show that FT-WBC significantly improves survival rate and workspace under weakening or locked failures, and transfers zero-shot to a real legged manipulator in the real world.
Varying Bundle Size Reactive Multi-Task Assignment using Selective Cost Estimation for Multi-Agent Systems
This paper presents a scalable framework for multi-robot task allocation in complex environments where estimating task execution costs is computationally expensive. While combinatorial auction-based approaches offer reliable solutions, the exponential complexity of bundle generation typically renders them intractable for real-time reactive applications, particularly when accurate path planning is required for cost validation. We address this through a distributed, two-stage multi-fidelity bundle generation approach. Agents utilize a local search tree guided by a low-fidelity heuristic (such as euclidean distance) to rapidly explore the bundle space, applying high-fidelity path planning only to the most promising candidates in a best-first manner. These refined bids are then submitted to a central coordinator that solves a set packing problem to ensure global feasibility and maximize the overall utility. Simulation results in multiple environments demonstrate that the framework is able to improve the performance of reactive auction-based task allocation. Overall, the presented framework is shown to enable reactive task allocation with dynamic bundle sizes in multiple settings without exposing the agents' state and internal cost estimation models.
comment: Accepted for publication at the 23rd World Congress of the International Federation of Automatic Control (IFAC 2026)
NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation IROS
Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body's pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot's proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: https://soham2560.github.io/no-contact-no-worries/
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) 2026
Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos
Vision-Language-Action (VLA) models require large-scale video-action pairs, yet real teleoperation remains scarce. While generated robot videos offer a scalable alternative, existing methods treat them as real robot data by recovering pseudo-actions from synthesized pixels. We argue that deriving low-level control from generated visuals is a mismatched abstraction. A video captures only \emph{geometry}: the spatial trajectory representing the \emph{where} of a task. A real demonstration captures \emph{control}: the exact motor commands representing the \emph{how}. Human-to-robot video generation preserves these unequally: the visible geometry survives the generation process, while the underlying control signals are lost. This \textbf{Asymmetric Preservation Principle} dictates a clean rule: this surviving geometry should solely supervise visual perception, leaving control to real demonstrations. Following this principle, we propose \textbf{GRA} (\textbf{G}eometry-guided \textbf{R}epresentation \textbf{A}lignment), which extracts the geometric content as future 2D end-effector waypoints, computed from the source human video through pose estimation, retargeting, simulation, and calibrated projection, and routes them to the VLA vision backbone via an auxiliary 2D head. The action head is trained on real demonstrations only. During fine-tuning, the waypoint loss persists as a \textbf{spatial representation anchor} that prevents the backbone from losing its geometric grounding. On real-robot tasks, GRA outperforms pseudo-action baselines under matched data budgets and narrows the gap to policies trained with substantially more real demonstrations, suggesting that correctly routed geometry bridges generated videos to robot policies more reliably than recovered actions.
comment: 14 pages, 5 figures
Legible and Intuitive Multi-modal Robot State and Intent Communication Validated in Online and Real-world Studies
Effective robot-to-human communication can increase transparency and trust, reduce uncertainty, and contribute to safer collaboration in shared workspaces. Designing and validating an effective robot communication strategy is challenging due to the varying and often limited communication modalities across robots, differences in how diverse recipients interpret messages, and the underexplored virtual-to-real gap in studies of communication legibility. We present a systematic, large-scale comparative validation of existing communication strategies for a mobile non-humanoid robot across message types and settings (online and in-person). Based on the prescribed message types in the existing standards for industrial robots, we realize and compare a low-expressive, unimodal LED-based strategy with a highly expressive, multimodal one that leverages robotic gaze, gestures, and voice. For each strategy, we analyze the communication of a turning intention, an attention request, error status, whether the robot is stuck, and whether it is functioning normally. We evaluate these strategies in replicated online and in-person experiments. We find strong evidence that highly expressive multimodal communication is perceived as more legible and intuitive than unimodal LED-based communication. Comparing the online and real-world study findings, we observe a notable decrease in overall legibility, particularly for signaling with LEDs. Similarly, confidence in message interpretation decreases during the real-world evaluation.
comment: Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
RE4: Transformation-aware Imitation of Object Interactions Using Manipulation Modes
Object interaction tasks have been a focus of advances in imitation learning. End-to-end methods, dominated by diffusion and flow-based variants have shown leaps in performance while sacrificing interpretability. Object-centric and pose-informed variants have had a role in learning from demonstration in manipulation tasks. In this paper, we revisit a few modern imitation learning benchmarks for object interactions, with the aim of composing a framework that repurposes principled theories of manipulation, preserving both performance and interpretability. For image observations, lightweight training is proposed for model-free pose estimation of the target object, using self-supervision over the demonstration data available for imitation learning. This information is then used to inform a manipulation mode-aware retrieval of a demonstration, a mode-aware transformation, a replan step that connects to the retrieval point while preserving mode constraints, and finally rolling out the transformed demonstration. These compose four key steps of the proposed RE4 framework, evaluated over state-based and image-based benchmarks in Push-T and Robomimic. An adversarial benchmark that evaluates sparse data regions of image-based Push-T showcases the robustness, further bolstered by indications from low-data regime experiments. The current work shows promise in using simple interpretable building blocks to learn manipulation skills.
comment: 8 pages, appendix
Average Rankings Mask Per-Subject Optimality: A Friedman-Nemenyi Benchmark of EEG Motor-Imagery BCI Decoders
Electroencephalography (EEG) is the dominant non-invasive modality for brain-computer interfaces (BCIs), yet reliable decoding of motor imagery is hampered by inter- and intra-individual variability. A recurring claim is that one decoding pipeline, most often a spatial or Riemannian method, is broadly preferable. We test the weakest version of that claim under the most favourable conditions. Using the Mother of All BCI Benchmarks (MOABB) framework, we evaluated 1,056 decoding configurations (feature extractor x scaler x classifier), >340,000 subject-level model fits, across three public left-versus-right motor-imagery datasets (PhysionetMI, 109 participants; Cho2017, 52; Zhou2016, 4) and two frequency bands (8-15 Hz, 8-30 Hz). Every model is fit and tested within a single session of a single participant, the easiest regime, giving every pipeline its best chance. We apply the statistics standard for multi-classifier comparison: Friedman omnibus tests, Nemenyi critical-difference analysis and Wilcoxon signed-rank tests with effect sizes. Covariance tangent-space projection (cov-tgsp) and Common Spatial Patterns (CSP) are the strongest families, but their ordering is dataset-dependent and, on the largest and most heterogeneous cohort (PhysionetMI), statistically indistinguishable (Nemenyi p = 0.27; Kendall's W = 0.11). At the individual level the single best pipeline is optimal for only 35% of PhysionetMI participants, and nonlinear descriptors are best for roughly one third; matching pipeline to participant adds about seven accuracy points over the best fixed choice. The ranking is not an artefact of dimensionality, and classifier and scaler choices are secondary to the feature representation. Even in the easiest regime, no single pipeline dominates: a lower bound on the personalization problem and a quantitative case for participant-aware model selection rather than a universal decoder.
comment: 16 pages, 6 figures, 4 tables
PDS Joint: A Parametric Double-Spiral Joint Tailored for Dexterous Hands
Compliant joints can embed safety and adaptability into dexterous hands, but achieving large-stroke anthropomorphic motion while maintaining joint-specific, directiondependent stiffness and reliable proprioception remains challenging. This paper presents the PDS joint, a parametric doublespiral (PDS) compliant joint that enables systematic shaping of directional stiffness across multiple deformation modes, including flexion/extension, abduction/adduction, and pronation/supination. We instantiate the joint using Archimedean and logarithmic spiral templates for different hand joints and introduce an asymmetry ratio to tailor stiffness distributions for both grasp stability and hyperextension resistance. To make the joint practically usable under large deformation, we co-design embedded inductive proprioception and propose a learningbased calibration pipeline that maps raw inductive signals to joint states using ArUco-marker tracking. Experiments characterize the stiffness landscapes across geometric parameters and demonstrate a non-monotonic dependence of lateral support on asymmetry, indicating the importance of principled parameter tuning. For joint-state estimation in the most challenging abduction/adduction motion, a learned multilayer-perceptron (MLP) mapping reduces the error compared with conventional curve fitting by 41.6%. Finally, we integrate the proposed joints into an open-source dexterous hand as a demonstration platform, on which the hand grasps a set of nine everyday objects and performs safe, contact-rich human-involved interactions.
SlipSense: Multimodal Sensing for Online Slip Detection in Legged Robots
Legged robots rely on accurate ground interaction awareness to traverse variable terrains, such as slippery surfaces. Existing slip detection methods often rely on kinematics and proprioception, which lack the sensitivity to detect early-stage slips that occur prior to catastrophic instability. Thus, this paper presents SlipSense, a novel framework for online force-based slip detection using a custom lightweight sensorized foot for quadrupeds to detect slip. The framework integrates a multimodal sensor design with a LSTM-based model to infer ground reaction forces and detect slip-indicative anomalies during locomotion. The proposed framework is deployed on a Unitree Go1 quadruped to demonstrate blind online slip detection over a slippery terrain. Our method detects early-stage slips down to an average displacement of 24.1 +/-6.4mm with an overall accuracy of 85.9%. This represents a 3.3-fold finer detection resolution and a 24% relative accuracy improvement over a standard kinematic baseline that uses foot velocity inferred through state estimation. The work in this paper serves as a foundation for force-aware gait adaptation in legged robotic locomotion, allowing future controllers to estimate terrain friction and adjust constraints, thus improving the overall stability of the system.
RoBoSR: Structured Scene Representations for Embodied Robotic Reasoning
Despite rapid progress, embodied reasoning under real-world variability remains challenging. Existing approaches rely on demonstration-driven sequential biases, limiting flexibility in open-ended and long-horizon tasks that require structured reasoning over evolving states. We introduce RoBoSR, an intermediate structural representation that formulates manipulation as step-wise state transitions over semantically grounded, object-centric scene graphs. By modeling object states and their spatial relations at the perception-action interface, RoBoSR disentangles high-level task reasoning from raw inputs and enables structured reasoning over preconditions, effects, and goal states. This representation endows the agent with causal reasoning capability, enforcing subtask dependencies and supporting coherent long-horizon task planning. To learn such structure-aware reasoning, we construct Manip-Cognition-1.6M, an open-world dataset that jointly supervises scene understanding, instruction interpretation, and subtask planning across diverse tasks. Across several benchmarks and real-world demonstrations, our method consistently outperforms prompting-based methods and classical TAMP baselines in zero-shot generalization and long-horizon tasks. The results underscore structured intermediate representations as a critical inductive bias for scalable embodied reasoning.
From Open Waters to Enclosed Cabins: ProteusVPR for Cross-Scene Visual Place Recognition in Maritime Perception and Cabin Inspection
Autonomous robotic inspection in maritime environments presents unique challenges for Visual Place Recognition (VPR) due to cross-scene perceptual shifts. Robots navigating ship-borne environments must transition between visually distinct domains: open decks with sparse textures and severe illumination changes, and enclosed cabins with repetitive structures and high visual ambiguity. Existing VPR methods, designed primarily for urban or indoor scenes, fail to generalize reliably across these starkly different scenarios. To address this, we propose ProteusVPR, a two-stage retrieval-refinement framework. The first stage employs any standard VPR model for initial image retrieval. The second stage introduces a geometric-visual estimation network that fuses the retrieved image with two temporally preceding frames, incorporating geometric descriptors, a local affine coordinate system, and camera azimuth encoding to achieve precise localization. To support this task, we introduce the XHZ dataset, an 8K-panoramic ship-borne dataset collected from an operational vessel, featuring multi-floor cabin structures, deck transition zones, and strict query-database separation for rigorous evaluation. Extensive experiments on the XHZ dataset demonstrate that ProteusVPR consistently improves the localization accuracy across multiple VPR backbones, reducing mean localization error by over 60\% on average and that ProteusVPR offers an effective and robust solution for precise visual localization in challenging, cross-scene maritime environments.
Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control
Diffusion models sample effectively from high-dimensional, multimodal distributions, but their outputs may violate deployment constraints. For task-space robot policies, generated grasps, waypoints, or trajectories can be distributionally valid yet infeasible, violating reachability, collision-avoidance, or closed-loop executability requirements. This embodiment gap limits zero-shot deployment across robots, even when the task-space behavior itself is transferable. We propose an inference-time optimization framework that couples the behavior generation to physical feasibility by formulating diffusion guidance as a constrained optimization problem. Our key insight is to replace the sampling perturbation in the backward process with an optimized correction, allowing hard constraints or soft penalties to be imposed during sampling without the need to retrain the diffusion model, while keeping samples close to the learned prior. We evaluate the method on dexterous grasp synthesis with reachability and collision-avoidance constraints, and dynamic manipulation with controller-level trackability constraints. Across settings and robot embodiments, optimization-guided denoising matches the feasibility of projection- and gradient-guidance baselines while better preserving grasp quality, and improving controller-level executability and task success, with task success improving by up to 20pp. on dexterous grasping and 23pp. on visuomotor manipulation over the best baseline.
Importance of Intent-Sharing for V2X-based Maneuver Coordination
This paper examines the critical role of intent-sharing in enabling effective maneuver coordination for connected and automated vehicles (CAVs). Successful maneuver coordinations require vehicles to accurately know other vehicles' driving intentions. Intent-sharing can be achieved by the remote vehicles directly communicating their plans with the ego vehicle, as opposed to the ego vehicle predicting the trajectory on the remote vehicles' behalf. In this paper, we investigate the potential of intent-sharing on maneuver coordination effectiveness by quantifying the percentage of successful coordinations. We analyze the potential of intent-sharing by comparing its effectiveness for coordinated lane changes in a highway scenario with the effectiveness of a trajectory prediction method based on current kinematic data. Our analysis demonstrates in two scenarios substantial improvements in maneuver coordination when CAVs have direct access to the nearby vehicles' driving intentions through intent sharing. These findings highlight the importance of including intent-sharing in the maneuver coordination protocol.
The Evaluation Cost of Task Specialization in Evolutionary Multi-Robot Systems GECCO '26
Task specialization can improve the efficiency of multi-robot systems (MRSs). Previous works have investigated the emergence of task-specialist robot controllers through evolutionary optimization and have argued that task specialization is more likely to evolve when subtask behaviors are readily available as building blocks. However, the available evaluation budget must be distributed across all subtasks, whereas a single generalist behavior can exploit the entire budget for its own optimization. We present a cost-benefit analysis of evolving task-specialist versus generalist behaviors in a foraging scenario here. In a physics-based robotics simulator, we study the total evaluation budget required to evolve task-specialist behaviors that outperform generalist behaviors across MRS sizes. We show that with increasing MRS size, a lower total evaluation budget is sufficient to evolve specialists that outperform generalists.
comment: Accepted for publication at GECCO '26 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion. Supplementary video: https://youtu.be/U5LNICts7Ek
Efficient Time-Domain Simulation of USV Motions in Short-Crested Irregular Waves Using an IRF-Based Framework
Traditional time-domain prediction of vessel motions in irregular waves usually relies on superposing responses from many regular-wave components, which is computationally expensive for long-duration simulation and real-time applications. This issue is particularly relevant to unmanned surface vehicles (USVs), for which efficient and realistic motion prediction is needed for seakeeping assessment, simulation-based testing, and control-system development. This study applies an impulse response function (IRF)-based time-domain framework to predict vessel motions in short-crested irregular waves. Froude-Krylov, diffraction, and radiation loads are obtained from frequency-domain analysis and transformed into the time domain. Instantaneous responses are then evaluated directly through convolution-based force reconstruction, reducing the need for repeated regular-wave simulations. Weak nonlinear restoring effects are included by instantaneous wetted-surface pressure integration, and directional wave spectra are used to represent realistic sea states. The framework is validated against model-test measurements of an offshore supply vessel in long-crested beam irregular waves and full-scale measurements of a USV operating in real sea conditions. Predicted significant amplitudes, mean zero-crossing periods, standard deviations, and motion time histories agree well with measurements. The effect of directional-spectrum discretization is also examined. Results show that motion amplitudes are moderately sensitive to directional resolution, whereas motion periods are relatively insensitive. A 30 deg directional interval provides a practical balance between prediction accuracy and computational cost. The proposed framework offers an efficient tool for high-fidelity time-domain prediction of USV motions in realistic directional irregular seas.
NavWM: A Unified Navigation World Model for Foresight-Driven Planning ECCV 2026
Conventional visual navigation policies often struggle with myopic decision-making and mode collapse in complex environments. While world models offer a promising alternative, existing paradigms typically isolate perception, generation, and control, failing to capture their shared spatio-temporal dynamics. In this paper, we propose NavWM, a unified navigation world model that seamlessly integrates latent world reasoning, multimodal action prediction, and controllable visual generation. At its core, NavWM leverages latent world tokens to distill geometric and semantic priors, endowing the agent with robust structural understanding. To overcome the limitations of deterministic policies, we introduce an anchor-based multimodal trajectory forecasting framework that generates a diverse action space. This inherent diversity explicitly empowers the generative world model to act as a robust closed-loop planner, utilizing visual foresight to evaluate and select the optimal path. Extensive experiments across diverse robotics datasets demonstrate that NavWM significantly advances the state-of-the-art, delivering remarkable improvements in both high-fidelity future state generation and zero-shot navigation success.
comment: 13 pages, 5 figures, accepted to ECCV 2026
DynaWM: Dynamics-Aware Distillation with World Model and Momentum Targets for Smooth Locomotion over Continuous Stairs IROS
Recent advances in control have enabled bipedal-wheeled robots to traverse slopes and single-step obstacles, yet long staircase traversal remains challenging as current teacher-student frameworks suffer from weakened dynamics-aware representations and incomplete terrain geometry encoding. To bridge this gap, we propose DynaWM, a dynamics-aware representation learning framework. To enhance terrain encoding capability and enable transparent assessment, we introduce a world model as a regularizer to enforce forward-dynamics awareness, preserving comprehensive terrain geometry while facilitating hierarchical encoding visualization. To stabilize knowledge transfer, we employ a momentum target encoder to provide consistent distillation targets, preventing dimensional collapse from non-stationary teacher updates. Evaluation of the learned representations through Principal Component Analysis (PCA) visualization and quantitative metrics reveals that our encoder hierarchically captures terrain geometry with higher terrain encoding capability, leading to enhanced terrain adaptability and motion smoothness. Experimental results in simulation and real hardware demonstrate that our method achieves superior terrain adaptability and motion smoothness, enabling bipedal-wheeled robots to overcome diverse continuous stairs, as shown in Fig. 1.
comment: Comments: 8 pages, 7 figures, accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
MinInter: Minimizing Trajectory Interpolation During Data Augmentation for Imitation Learning
Imitation learning enables robots to acquire complex manipulation skills from demonstrations, but its effectiveness is limited by the cost of collecting high-quality data. Trajectory-level data augmentation methods alleviate this challenge by recombining expert demonstrations under varied initial states. However, such methods typically insert interpolations or other non-expert transition segments between disjoint parts, and such non-expert segments could reduce the quality of the generated data. This paper introduces Minimizing Interpolation (MinInter), an effective trajectory selection method that, for each sampled initial configuration, chooses the source demonstration requiring the least interpolation to form a complete trajectory. By explicitly minimizing interpolations during data generation, MinInter produces higher-quality synthetic demonstrations while remaining compatible with existing data generation frameworks. Experiments on 12 manipulation tasks with 26 variants from the MimicGen benchmark show that MinInter consistently improves both data generation success rates and policy success rates, with the largest gains on contact-rich, long-horizon and high-variance settings. Compared to the recent SkillGen framework, MinInter achieves higher policy success rates despite its conceptual simplicity, underscoring the value of interpolation minimization for data augmentation.
comment: Accepted by IEEE CASE 2026
ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration
Embodied reasoning and exploration are increasingly considered crucial abilities for robots operating in complex and unfamiliar environments. To accomplish tasks in such settings, an agent must identify and acquire the information necessary for the task through exploration. We propose ObsGraph, an observation-centric hierarchical scene graph that unifies scene representation, retrieval, and exploration. It retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, we perform coarse-to-fine hierarchical retrieval under a bounded budget, and crucially use retrieval outcomes to structure the exploration candidate space--activating room-level exploration, view refinement, or frontier exploration--thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration. Experiments across embodied reasoning and exploration benchmarks demonstrate improved success and efficiency, highlighting the benefits of structured scene representation and more targeted information gathering driven by identified evidence gaps.
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies
In robot learning, scaling training datasets across diverse embodiments and environments has become a dominant paradigm for learning generalizable robot policies. These policies are commonly trained via behavior cloning to imitate actions from pre-collected demonstrations. However, since robot actions are tied to the dynamics of the data collection robot, different robots may require different actions to achieve the same motion. This discrepancy hinders both policy training and deployment across diverse robots. To address this, we propose using Cartesian state delta as a universal action representation across robots, and introduce State Prediction and Adaptive Command Execution (SPACE) framework. SPACE handles robot dynamics variation at three levels: across different embodiments, across hardware units of the same embodiment, and within a single robot during operation. It consists of two components: (i) a Cartesian state delta policy that predicts geometric end-effector displacement, and (ii) Action Adapter, which converts the predicted Cartesian state delta into robot-specific control commands. Experiments show that SPACE substantially outperforms policies that directly predict control commands when learning from data collected across different embodiments and across hardware units of the same embodiment. SPACE also remains robust under dynamics shifts at deployment, including changes in control frequency, object weight, and controller gains. The project page is available at http://haeone.site/space-website/.
comment: Project page: http://haeone.site/space-website
TurboMPC: Fast, Scalable, and Differentiable Model Predictive Control on the GPU
Robotics increasingly relies on GPUs for parallel simulation, large-scale learning, and neural-network inference. For model predictive control (MPC) to scale with this paradigm, solvers must run efficiently on this hardware while remaining fast, differentiable, and compatible with expressive MPC formulations used in robotics. We present TurboMPC, a differentiable MPC solver that runs entirely on the GPU and supports state and control inequality constraints, implicit integrators, cross-time-coupled costs, and slack variables. TurboMPC combines sequential quadratic programming (SQP), an alternating direction method of multipliers (ADMM) inner solver, implicit differentiation, and a co-designed JAX-CUDA implementation for efficiency and ease of use. In simulation, we validate TurboMPC on constrained planning, humanoid imitation learning, and reinforcement learning with neural-network cost function tasks, achieving up to $15\times$ and $58\times$ speedups over state-of-the-art CPU and GPU differentiable solvers, respectively. We deploy TurboMPC on a full-scale car for minimum-time racing and find that batched, GPU-accelerated tuning of MPC parameters via Bayesian optimization yields significantly faster driving than a hand-tuned baseline. TurboMPC also scales to planning horizons of over $8000$ knot points while maintaining control of the vehicle. We open-source TurboMPC at: https://github.com/ToyotaResearchInstitute/turbompc
Sim-to-Real Betting on the E-Process: Bringing "simulators" to anytime-valid confidence sequences
This note describes an integration of the sim-to-real performance estimate with betting (from Chen et al.) and the safe anytime-valid inference (from Ramdas et al.). Using the scaled simulators. The method produces efficient, reliable certificates for the mean estimate, an approach that is especially valuable in robot performance testing. This note gives a primary, self-contained account of the construction; preliminaries of the respective methods are kept at a minimum, and one shall refer to the original works for full detail. Some synthetic examples demonstrating the proposed algorithm can be found at https://github.com/ISUSAIL/Bet4Sim2Real-EProcess.
comment: Affiliated open source code at: https://github.com/ISUSAIL/Bet4Sim2Real-EProcess
GRAFT: Graph-Based Affordance Transfer via Part Correspondence
Generalizing robotic manipulation to unseen objects remains challenging, as learning-based approaches require many demonstrations and fail in few-shot settings. Prior work transfers affordances through semantic retrieval, but semantics alone neglect geometric similarity, which is critical for manipulation. We propose GRAFT, a geometry-aware correspondence framework for zero-shot manipulation transfer using only one demonstration per object. Objects are represented as part-based graphs, where part-level descriptors support global instance retrieval and part correspondence, and vertex-level descriptors enable fine-grained contact point matching. For an unseen object, our method first retrieves the most functionally and geometrically similar instance from the demonstration buffer with aligned functional parts, and finally propagates the contact points through point-wise correspondence.
Spatio-Temporal Retrieval-based Priors for Adaptive Computational Teaching in Driving
Learning-based automated coaching systems for complex motor tasks such as high-performance driving remain limited in the ability to be adaptive by their reliance only on local, context-dependent reasoning, failing to account for the long-term temporal nature of student learning and the cumulative impact of repeated teacher-student interactions. In this paper, we propose an imitation learning based computational model for adaptive teaching with a dedicated temporal reasoning module that can reason over the interaction history under low-data regimes. To compensate for limited amounts of interactive training data, and based on the repetitive nature of the teaching process, the model relies on a nearest neighbor retrieval and cross attention prior, reasoning only on a narrowed-down set of semantically similar past interactions with an encoder-decoder based concurrent teaching model. We validate our approach with (i) a novel semi-synthetic closed-loop longitudinal student-teacher interaction dataset based on Waymo Open Motion Dataset and (ii) a small-scale real-world naturalistic simulator race coaching dataset. Our results reveal the consistent advantage of our adaptive teaching model with the nearest neighbor retrieval and cross-attention prior over a non-adaptive baseline as well as a suite of adaptive models that differ in their choice of priors and temporal fusion mechanisms.
comment: 20 pages, 8 figures
Swazure: Swarm Measurement of Pose for Flying Light Specks
One may construct a 3D multimedia display using miniature drones configured with light sources, Flying Light Specks (FLSs). Swarms of FLSs localize to illuminate complex 3D shapes and animated sequences consistent with the coordinates of points in a point cloud. This requires FLSs to accurately measure their pose relative to one another using sensors such as cameras. Such sensors have a sweet range in which they provide the highest accuracy. A challenge is how an FLS tracks another FLS outside its sensor's sweet range, dictated by the point cloud data. We address this challenge by proposing a novel technique called Swazure that solves the missing sensor data using cooperation among FLSs. It implements physical data independence by abstracting the physical characteristics of the sensors, making point cloud data independent of the sensor hardware. The size of an FLS relative to the minimum distance between points of a point cloud is an important parameter. With medium-sized FLSs, Swazure is able to position 100% of the FLS's neighbors. Larger FLS sizes may result in potential obstructions that prevent Swazure from quantifying relative pose. We present two heuristics, Move Obstructing and Move Source, to address this limitation. Our experimental results show the superiority of the Move Obstructing heuristic which resolves approximately 30% of obstructions in the worst case scenario.
comment: Appeared in Proceedings of the Holodecks Foundation, Vol. 2, No. 3
Reflective VLA: In-Context Action Consequences Make VLAs Generalize
Most vision-language-action (VLA) models are reactive: they predict the next action from the current instruction and observation, implicitly assuming that the current observation fully specifies the action-relevant state. In embodied control, however, embodiment-specific factors such as camera-to-robot geometry, robot calibration, or systematic actuation bias are often hard to identify from a single observation. As a result, reactive policies cannot reliably disambiguate these factors in general, overfitting to training environments and generalizing poorly at deployment. We propose Reflective VLA, which conditions each decision on a context of observation-action-consequence triplets. Each triplet records not only what the robot observed and executed, but also how the scene changed afterward, exposing the deployment-specific mapping from actions to observed effects. Architecturally, Reflective VLA routes all observation modalities through the VLM under shared attention, so the action expert reasons directly over past triplets and the current observation. A block-causal mask enables parallel multi-frame training without leakage and supports KV-cached real-time inference. On standard LIBERO and SimplerEnv-Bridge, Reflective VLA preserves strong in-distribution performance. Under distribution shift on LIBERO-Plus and the harder LIBERO-Plus-Hard, it improves average success rate by 5.4 and 4.2 percentage points over a matched reactive baseline. Ablations with a matched history-only baseline further show that action consequences -- rather than additional context length alone -- are the key to cross-environment generalization. Project page: https://lianqing11.github.io/reflective-vla-page/
RigPI: Dynamic Parameter Identification of Rigid Body via VLM-Seeded Differentiable Simulation IROS 2027
Accurate physical parameter identification of manipulated objects is fundamental to advanced robotic manipulation and the construction of faithful digital twins. However, acquiring physically consistent inertial and frictional properties from real-world interactions remains challenging due to sensing noise, modeling errors, and limited prior knowledge. This paper presents \textbf{RigPI}, a systematic framework for identifying dynamic parameters of both unconstrained rigid bodies and multi-link rigid bodies during robot-object interaction. RigPI integrates vision-based semantic priors, force-torque measurements, and motion observations within a differentiable simulation pipeline. A vision-language model (VLM) provides informed initialization and a constrained search space, while gradient information from a differentiable physics simulator enables efficient and stable parameter refinement. The proposed two-stage optimization strategy alleviates sensitivity to noise and avoids physically implausible solutions. Extensive real-world experiments on objects with revolute and prismatic joints demonstrate that RigPI achieves accurate and stable parameter estimates, and successfully reproduces manipulation trajectories on a real robot with parameter-aware predictive validity. These results highlight the effectiveness and robustness of RigPI for real-world robotic system identification tasks.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2027)
Wear-Clearance-Impact Coupling in the Jansen Linkage: A Gait-Durability-Optimized Design Slows Joint Loosening
A companion study introduced joint durability into the dimensional design of the Theo Jansen walking linkage and found its classical "holy numbers" Pareto-dominated, but it modelled the revolute joints as ideal, clearance-free pins, so its wear figures were relative rankings, not a prediction of in-service degradation. Here we relax that idealization. We build a forward-dynamic model of the Jansen leg in which a revolute joint becomes a clearance joint with a continuous normal contact law (Lankarani-Flores, hysteresis-damped) and Ambrosio friction, integrated as a constraint-stabilized differential-algebraic system, and couple it to the Archard law in a wear->clearance->impact feedback loop. Three findings emerge. First, neglecting clearance underestimates the peak joint load: the clearance model gives a peak contact force of ~104 N at the load-bearing pin against ~48 N for the ideal joint (~2x amplification), rising to ~426 N when two joints carry clearance at once. Second, the coupling is strongly impact-sensitive--single trajectories are non-monotonic and can reverse the design ranking, a chaos consistent with the literature--so designs must be compared statistically; over an ensemble of 16 randomized phases the optimized joint is robustly more durable, with per-cycle wear ~9-7x lower (peak force ~4x lower) at one clearance joint and still ~1.7x lower on both with two (p<0.01 throughout). Third, the wear is strongly non-uniform--it concentrates on a ~10 deg load arc--so assuming uniform clearance growth underestimates local clearance growth by ~36x. The clearance-free durability advantage thus survives the chaotic, multi-joint, non-uniformly-worn coupling in the ensemble mean. We deliver the first clearance-coupled forward-dynamic model of the Jansen leg and specify a falsifiable protocol to test each prediction.
comment: 10 pages, 9 figures. Companion to arXiv:2606.22129
RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory
Long-term robot deployment requires a compact and scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval. In this paper, we propose RAVEN, an agentic memory system for long-horizon robotic question answering and navigation. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries and navigate to goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text captioning and enables accurate semantic, spatial, and temporal retrieval at scale. Across several simulated and real-world video question-answering benchmarks, RAVEN consistently surpasses caption-based memory systems and matches frontier VLMs on long-horizon tasks at 10$\times$ lower retrieval cost. Finally, we instantiate RAVEN on a Unitree Go1 robot for the task of long-horizon navigation for natural language goal-reaching, and show successful deployment over several large indoor environments.
comment: Project website: https://ravenmem.github.io/
ARTOO-DARTU: Studying AR-HRC With AR Obstruction Mitigation During a Warehouse Task
Human-robot collaboration (HRC) often requires robot intentions and internal states to be conveyed to users for task efficiency and safety. Recently, augmented reality (AR) situated analytics provide such real-time robot feedback in HRC contexts. However, AR situated analytics can obstruct important real-world elements, posing safety and usability risks, especially when content is dynamically positioned relative to movements of mobile robots in a warehouse HRC scenario. In this paper, we introduce the Augmented Reality Technique Of Obstruction Deterrence while Aiding Robotic Teaming for Users (ARTOO-DARTU), an AR system tailored specifically for warehouse HRC that enables real-time robot situated analytics and control while preserving visibility of the real world through an obstruction detection and mitigation pipeline (ODM) that is uniquely suited for AR-HRC. To evaluate ARTOO-DARTU, we developed Pocket MonstARs, a controlled gamified abstraction of HRC warehouse inventory picking in which virtual monsters serve as proxies for pick targets, while labeled and object-marked boxes preserve the real-world identification demands of the picking task. In a 34-participant user study, we found that our designed AR situated analytics yielded a 46% increase in efficiency on the overall HRC task, but only when the ODM was active. Participants with the ODM active were also 61% faster on subtasks requiring visibility of the real world. Our findings demonstrate that, when paired with our developed ODM to prevent real-world obstructions, the situated analytics in ARTOO-DARTU can significantly enhance efficiency and user experience in AR-HRC warehouse scenarios.
comment: To appear in Proceedings of the ACM on Human-Computer Interaction, Vol. 10, No. 5, Article MHCI7668, MobileHCI 2026
Learning Perceptive Platform Adaptive Locomotion Controllers for Quadrupedal Robots
Universal quadrupedal locomotion remains limited by the difficulty of integrating perception across diverse robot morphologies. State-of-the-art controllers rely on single-robot training or blind policies that omit real-time perception, leading to poor cross-embodiment generalization. Designing locomotion policies that remain robust across related quadruped morphologies while incorporating perception is challenging. Moreover, fully perceptive policies are often sensitive to noise, whereas blind controllers lack terrain awareness. In this work, we study how perception should be integrated into morphology-aware reinforcement learning architectures for deployable quadrupedal control. Building on MorAL, we train morphology-specialized universal controllers on multiple reference quadrupeds using adaptive terrain curricula. We compare a blind baseline, a critic-perceptive variant (MorAL+), and a fully perceptive actor-critic (PPAL). Policies are evaluated in simulation on flat and rough terrains, and deployed on ANYmal hardware. Results show that critic-only perception improves robustness and tracking consistency over blind baselines while remaining more stable than fully perceptive policies under perception noise. These findings highlight that perception placement and curriculum design are key factors for scalable, morphology-aware locomotion.
fARfetch: Enabling Collocated AR-HRC in Large Visually Diverse Environments with VLM-Driven AR Content Adaptation
Augmented Reality (AR) can improve collocated human-robot collaboration by making robot state and intent visible and enabling intuitive control, yet large, visually diverse environments like the outdoors challenge both interaction and content legibility, especially at long distances and beyond visual line of sight. We present fARfetch, an AR-HRC system that integrates (i) shared semantic environment mapping across an AR headset and robot that visualizes detected landmarks in AR to support landmark-grounded go-to commands, (ii) a context-aware world-in-miniature representation of the shared environment for fine-grained path authoring, and (iii) vision-language-model driven AR view management that jointly adapts virtual content color, size, and orientation to maintain legibility in large visually diverse environments. We implement fARfetch with a Meta Quest 3 headset and Unitree Go2 quadruped robot, and conduct a within-subjects user study (N=13) on a real-world large-scale (30.5m) outdoor inspection task. fARfetch yielded significantly faster completion times than a non-AR baseline (66%) and significantly lower workload in mental demand (-43%), temporal demand (-34%), and frustration (-66%). A custom legibility survey indicated fARfetch effectively maintained virtual content legibility in the large outdoor environment.
comment: Accepted to the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). Author accepted manuscript
Toward Low-Latency Vision-Language Models with Doubly-Correct Predictions in Egocentric Visual Understanding IROS
The rapid rise of Vision-Language Models (VLMs) in egocentric visual understanding has made low-latency inference in human-robot collaborative (HRC) tasks increasingly critical. Weight pruning techniques developed for VLMs to shrink model size and computation can be readily applied to satisfy the efficiency demands of on-board processing and real-time interactive robotics. Moreover, safe human-robot interaction demands pruning strategies that preserve doubly-correct predictions; outputs must be both accurate and evidentially grounded to mitigate risks and ensure user trust. In this paper, we present a new study of VLM pruning through the lens of doubly-correct prediction. Our experiments surprisingly show that existing pruning methods often preserve the right evidence localization but undermine correct prediction. To address this, we propose a rationale-informed pruning strategy that better aligns evidence with decisions. Benchmark results on egocentric video datasets demonstrate that our method not only achieves the highest prediction accuracy but also outperforms existing approaches in attaining doubly-correct predictions. We aim to stimulate research on efficient and reliable VLMs, ensuring accuracy-driven advances align with the transparency, auditability, and safety required for responsible human-robot interaction and embodied intelligence.
comment: International Conference on Intelligent Robots and Systems (IROS) 2026
SwarmFly: A simulation platform for UAV swarm experiment design and validation
The initial development phase of UAV swarms largely depends on simulation for experimental design and validation, yet existing open-source tools are often unmaintained, have steep learning curves, or are built around a single fixed scenario. The need for a comprehensive, modular simulation platform is a recognized research gap. This paper presents SwarmFly, a MATLAB-based simulation and test platform for multi- UAV swarms that addresses these gaps. SwarmFly combines a real-time operational map, four swarm coordination modes (leader-follower, decentralized, heterogeneous relay, and heterogeneous speed), simulated IMU telemetry, and IP-based geolocation with a plugin architecture that lets researchers add behaviors, fault models, and analysis tools without touching the core code. Eight bundled plugins extend the base simulator into a full test harness. The SwarmFly platform exposes multi-agent aerial swarms to a wide range of internal and external disruptions, enabling observation and quantification of underlying swarm control and behavioral mechanisms. This study verifies and characterizes each subsystem through eight experiments that measure formation accuracy, wind tolerance, fault recovery, energy endurance, and airspace compliance. The platform runs entirely in MATLAB. Its modular design supports straightforward extension toward hardware-in-the-loop testing, larger swarms, and higher-fidelity dynamics. An open-source release is available at [https://github.com/abhishekphadke/SwarmFly.git]
Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control
General-purpose robots operating in partially observable environments, such as homes, require memory to support autonomy. They must recall diverse information from the past, such as where objects were placed, which tasks a human partner has completed, and when an appliance was turned on. Achieving this versatility requires a general memory retrieval mechanism. Transformer architectures that use attention over long contexts for memory retrieval provide a promising approach, as they learn retrieval from data rather than relying on task-specific or hand-designed rules. However, directly incorporating them into imitation learning from offline data introduces two key challenges: (1) the policy may learn spurious correlations between past information and predicted actions, and (2) errors accumulate in memory due to prediction inaccuracies and their compounding interactions with the environment, causing model drift and cascading failures. To address both challenges, we introduce HALO, a visuomotor policy with an attention-based memory retrieval mechanism for long-horizon control. First, to suppress spurious correlations, HALO distills vision-language model (VLM) priors into the policy. It generates memory-dependent question--answer pairs from demonstration trajectories and trains jointly with a video question--answering objective, steering retrieval toward task-relevant information. Second, to reduce the impact of accumulated errors in memory during closed-loop control, HALO uses sparse attention that restricts retrieval to only the most relevant parts of the history. Together, these components enable more reliable long-horizon control by guiding the policy to retrieve task-relevant information from up to eight minutes of past experience. Project website: https://robin-lab.cs.utexas.edu/HALO
comment: 16 pages, 5 tables, 8 figures
Causality-Based Parametric Control Barrier Function for Safe Multi-Vehicle Interaction ICRA 2026
Safe control has been widely studied in various safety-critical applications, for instance, autonomous driving. In order to ensure the autonomous vehicle does not collide with other vehicles, it is essential to obtain an accurate expectation of surrounding vehicles' behavior and react adaptively. Instead of assuming fully cooperative and homogeneous vehicles using the same safety-critical controllers, recent works have been exploring different data-driven approaches to model the neighboring vehicles' underlying controllers with observed data. However, existing works either suffer from 1) the inter-vehicle influence during the multi-vehicle interaction, which makes it hard to determine the causality of surrounding vehicles' behavior in controller modeling, or 2) being dominated by the worst-case analysis, which may lead to overly conservative behavior. In this paper, we extend the prior work on Parametric-Control Barrier Function (Parametric-CBF) to multi-robot interactions with embedded causality inference to explicitly reason over the inter-vehicle influence. Given the learned Causality-based Parametric-CBF, we present an adaptive safety-critical controller that allows the ego vehicle to safely react to surrounding vehicles with the learned expectation. We demonstrate that by leveraging the motion flexibility among multi-vehicle systems, task efficiency can be greatly improved in various interaction-intensive scenarios.
comment: accepted ICRA 2026
RGB: RL Guided Whole-Body MPPI for Humanoid Control
Humanoid robots require whole-body controllers that are both robust and precise in contact-rich environments. While deep reinforcement learning (RL) achieves robust stability, its behavior is tightly coupled to the training objective and command interface, making it difficult to add new feedback objectives without retraining. In this study, we propose an RL guided whole-body model predictive path integral (MPPI) framework that acts as an add-on feedback controller on top of a pretrained RL policy. Instead of using RL policy as the final controller, we use it as a sampling prior that biases MPPI rollouts toward dynamically feasible behaviors. Task objectives are specified through modular MPPI cost terms, and MPPI closes the loop by continuously correcting the RL prior online to satisfy these objectives without retraining the policy. Simulations on a 29-DoF Unitree G1 humanoid in MuJoCo demonstrate stable high-rate control (average 280~Hz). The proposed method improves task-level precision over a pure RL baseline under the same command interface. This is achieved by correcting systematic drift during straight walking and tracking additional whole-body reference signals imposed through the cost.
comment: 7pages
AeroCast: Probabilistic 3D Trajectory Prediction for Non-Cooperative Aerial Obstacles via Transformer-MDN Architecture
Autonomous aerial vehicles operating in shared airspace must predict the future positions of non-cooperative obstacles to plan evasive maneuvers before a collision becomes unavoidable. Unlike cooperative systems that share intent, non-cooperative obstacles such as birds, uncontrolled drones, or debris exhibit multi-modal motion that deterministic predictors cannot adequately represent. Existing methods either rely on recurrent encoders that propagate temporal information sequentially, limiting their ability to capture long-range kinematic precursors of maneuver initiation, or produce point forecasts that provide no distributional information to downstream planners. This paper presents AeroCast, a probabilistic trajectory prediction framework that combines a Transformer encoder with a Mixture Density Network output head to predict per-timestep Gaussian mixture distributions over future three-dimensional displacements. A translation-invariant consecutive displacement encoding and a calibration-oriented training objective address the input design and mode-degeneracy challenges specific to mixture-based aerial trajectory prediction. On a hybrid real-and-synthetic quadrotor corpus spanning nine motion categories, AeroCast reduces Average Displacement Error and Final Displacement Error by approximately 50% relative to the baselines over a five-second horizon, and achieves the lowest negative log-likelihood and Continuous Ranked Probability Score among all compared methods. Ablation analysis identifies velocity input and model capacity as the primary contributors to prediction quality, and positional encoding as essential for long-horizon trajectory coherence. AeroCast inference completes in 0.1ms per sample, compatible with real-time onboard deployment at 100Hz.
SurveilNav: Collaborative Object Goal Navigation with Robot and Surveillance System ICRA 2026
With the growing deployment of surveillance systems in factories, offices, and homes, integrating them with robots offers a promising direction for collaborative and efficient task execution. However, existing approaches largely focus on single-robot scenarios and struggle with multi-view collaboration in large-scale environments. In this paper, we present a novel indoor collaborative object navigation dataset built on Habitat-Sim, featuring 206 cameras across 74 floors. The dataset enables systematic evaluation of an agent's ability to exploit multi-view surveillance information. To address the limitations of single-robot perception, we propose SurveilNav, a collaborative navigation framework that integrates active camera scheduling, joint 2D/3D mapping, VLM-based value estimation, and collaborative target verification. By synergizing the robot's dynamic local perception with the static global view of surveillance, this architecture effectively overcomes both the limited perception range of single agents and the inherent blind spots of fixed cameras, resolving inefficient exploration. Experimental results on the HM3D dataset demonstrate that SurveilNav substantially outperforms existing methods, achieving state-of-the-art performance in both exploration efficiency and navigation success rate. Moreover, the system shows strong potential for applications in large-scale search, home environments, and rescue missions.
comment: Accepted by ICRA 2026
ADM-Fusion: Adaptive Deep Multi-Sensor Fusion for Robust Ego-Motion Estimation in Diverse Conditions
Robust multi-sensor fusion is essential for reliable autonomy in diverse and degraded environments, where sensor reliability can fluctuate rapidly. Because different modalities fail in distinct ways, effective fusion should adaptively balance complementary cues rather than rely on fixed weighting. This adaptability is particularly important for ego-motion estimation, since accurate updates depend on the consistent integration of complementary sensor information. We propose ADM-Fusion, an end-to-end deep learning based multi-sensor fusion method designed to adapt to environmental changes and sensor degradation. ADM-Fusion employs an adaptive sensor mixture-of-experts framework with content-aware routing to dynamically assign weights to sensor inputs in real time. The system further incorporates separate translation and rotation branches, coupled through a cross-task attention mechanism to preserve task-specific specialization while enabling information sharing. ADM-Fusion is trained on the CARLA-LOC simulated dataset and subsequently fine-tuned on KITTI real-world data, demonstrating effective simulation-to-real transfer. Experiments show that ADM-Fusion remains robust under degraded conditions while maintaining competitive performance against existing methods.
comment: 8 pages, 4 figures
Invariant Kalman filtering for extended pose estimation in multi-IMU articulated rigid-body systems
Accurate extended pose estimation (orientation, velocity, and position) for IMU-instrumented articulated rigid-body systems is a key challenge in robotics and human motion analysis. The invariant extended Kalman filter (IEKF) addresses this problem for a single rigid body with convergence guarantees and consistency under unobservability, but extending these properties to articulated systems is nontrivial: inter-body pose coupling prevents a direct application, and incorporating joint kinematic constraints within the invariant framework remains an open problem. To address this gap, we introduce the relative L-extended pose, a Lie group representation for kinematic-tree systems. With one IMU per body, it yields group-affine dynamics and allows joint constraints to be expressed in invariant form. We incorporate these constraints as noise-free pseudo-measurements within an iterated IEKF (IterIEKF), thereby preserving the convergence and consistency guarantees of invariant filtering. Validated on both a UR5e robot and a human leg, the proposed IterIEKF outperforms all EKF, IterEKF, and absolute-pose IterIEKF baselines. It converges faster, exhibits lower run-to-run variability, and consistently achieves the lowest RMSE, with reductions of at least 50% compared to the second-best filter across all scenarios considered in this work.
BFMTrack: Latent Sequence Optimization for Physics-Based Motion Tracking with Behavioral Foundation Models
Behavioral Foundation Models (BFMs) offer a promising path toward universal physics-based character control by organizing a rich repertoire of physically plausible behaviors into a latent space, guided by a large-scale motion dataset. While these models excel at time-invariant tasks, such as goal-reaching and state-based reward optimization, their latent space does not directly support time-varying objectives, such as tracking a motion sequence. For tracking, existing heuristics rely on moving-window-averaging that fails to capture the nuances of highly dynamic motions. In this work, we propose a novel Latent Sequence Optimization (LSO) to address these shortcomings. Our approach combines simulation rollouts with a policy gradient update to optimize over a sequence of latents, extending the capabilities of BFMs toward precise motion tracking without requiring reward engineering and tuning. To guide the optimization toward smooth, coherent latent trajectories, we model the latent sequence using temporally correlated noise. We validate our approach across dense tracking, sparse keyframing, and direct deployment onto a real humanoid robot.
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models
Vision-Language-Action (VLA) models are emerging robotic control systems that integrate perception, language understanding, and action generation in a unified architecture. Existing testing approaches for VLA-enabled robots rely on manually constructed symbolic test oracles that determine task success from final environment states. These oracles are costly to construct, require domain expertise, and are often tightly coupled to specific tasks and environments, limiting scalability and reuse. Furthermore, they provide only end-state assessments of task outcomes, offering limited insight into intermediate behavior and fault localization. To address these limitations, we introduce MANGO, a multi-agent framework that automatically generates fine-grained oracles from natural-language descriptions of robotic tasks. MANGO first generates a reusable library of atomic tasks, then generates simulator-grounded oracle definitions for each atomic task, and finally produces executable fine-grained oracles by decomposing complex instructions into ordered sequences of atomic actions and corresponding oracles. The framework uses collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback. We evaluate MANGO on the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks. Results show that MANGO generates executable, fine-grained oracles that detect a similar number of failures as symbolic oracles while accurately localizing them and providing richer diagnostic information. Through ablation studies, we further analyzed component contributions and the effect of initial task set, while preserving oracle quality. Overall, the results show the feasibility and effectiveness of test oracle generation for VLA-enabled robots testing.
Swarm-Inspired Generation of Collective Behaviors in Graph Dynamical Systems
Collective behavior arises when locally interacting units produce coordinated global organization, from synchronization in dynamical systems to task-relevant information flow on graphs. The central challenge is not only to explain how collective behavior emerges, but to design local interaction rules that can produce desired global organization and generalize across graphs, dynamics and tasks.To address this challenge, we introduce the Swarm-Inspired Emergent Synchronizer (SIES), a graph-dynamical framework that learns generalizable local-interaction laws for controllable collective organization. Each node is an agent-like dynamical unit with a state and task cue, and signed source-target-conditioned attention acts as an adaptive coupling term inside an explicit evolution model. Therefore, SIES combines an explicit dynamical engine with local agent intelligence, similar to biological swarms. For synchronization control, SIES learns a generalizable coupling operator that produces prescribed synchronization patterns for CDSs across untrained network scales, target phase relations, and intrinsic node dynamics without retraining. The learned operator also reaches gait-related modes faster than three oscillator baselines and generalizes synchronization-driven locomotion to simulated multi-legged robots of different scales and a physical hexapod after leg disablement. For graph representation learning, SIES applies the same signed interaction principle to message passing and achieves the highest performance among the compared methods on heterophilous node-classification benchmarks. Together, these results position SIES as a generalizable and learnable graph-dynamical interaction framework with promise for synchronization control, adaptive robot coordination, and heterophilous graph representation learning.
Conformal Orbit-Valid Trust Horizons for Equivariant World Models
Learned world models are useful only over horizons on which their rollout error remains controlled. We study trust-horizon certification for latent world models with known group symmetries. Given a one-step latent residual and a finite-time expansion estimate, we form a raw horizon curve and calibrate it with a split-conformal multiplicative factor. On the reproducible audit set, the conformal factor is $γ_α=1.0$: the raw certificate is already conservative under the audit protocol. Across 50 stable audits, we observe zero anti-conservative violations, corresponding to an exact-binomial 95% upper bound of 5.8% on the violation rate. Our main structural result is that exact equivariance transports a calibrated trust-horizon curve over the group orbit: when the environment dynamics, encoder, predictor, action transform, and latent metric satisfy the stated equivariance/invariance conditions, rollout errors and trust horizons are orbit-constant. Empirically, the implemented models exhibit small orbit-transport residuals, with median 1.1% and maximum 4.1% over 14 orbit audits. The certificate is also non-vacuous (median certified-to-measured horizon ratio 0.67). A certificate-level calibration-cost study shows two complementary regimes. On a symmetric 2D substrate, equivariant, plain, and augmented models are all orbit-valid from a single calibration sector -- no separation, because the substrate already makes non-equivariant baselines approximately orbit-robust. A 3D yaw audit shows the other regime: the equivariant model obtains a one-sector safe and non-vacuous orbit-valid certificate, while healthy non-equivariant baselines pay violation, slack, sharpness, or additional-sector cost. The certificate is a conservative, distributional audit rather than a global reachability guarantee, and certificate-guided subgoal spacing is not confirmed in the current 3D CEM-MPC behavior layer.
comment: 15 pages, 6 figures
When Do Conservation Laws Survive Learned Representations? Certified Horizons for Latent World Models
We ask a representation-learning question about physical world models: when does a conservation law remain certifiable after a model learns a latent representation? A certified horizon bounds -- in advance, from measurable model defects -- how many steps a rollout provably stays on a physical invariant's level set. The key design choice is what is certified: not a learned latent Hamiltonian or a learned scalar witness (a model can conserve either while drifting in true energy), but the decoded physical invariant obtained by decoding the latent state and evaluating the known invariant. Around this object we derive shell-horizon certificates whose budget decomposes into representation, readout, and latent-dynamics defects, with a monotone alignment bridge through which a soft learned witness yields a certified horizon for the decoded invariant, and test them across state, learned-lift, and pixel observations on conservative systems. Conservation certificates can survive learned representation, but not all geometric priors survive equally: hard canonical symplectic structure yields the longest horizons in known phase coordinates yet does not cross a learned chart, whereas a controlled-Lipschitz-aligned soft invariant survives in the learned-representation settings we test; pixel certification is recovered on a readout-stable sub-tube; and the Kepler problem exposes a geometric boundary. The central object is therefore not a latent Hamiltonian, but a decoded physical invariant whose robustness to representation learning can be measured, certified, and falsified.
comment: 15 pages, including appendices. Code: https://github.com/TimothyWang418/se3-ejepa
Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments
Monte Carlo tree search is attractive for robotic manipulation because it can improve action selection through simulation without requiring a fully differentiable policy. In uncertain domains, however, sparse terminal rewards and noisy transitions can make shallow search brittle: many candidate branches remain indistinguishable until late rollouts, and small simulation budgets amplify this ambiguity. This paper presents Reward-Centered ReST-MCTS, a decision-making framework that decomposes intermediate feedback into rule, heuristic, optional neural, and value-estimation channels, centers the resulting process signal against matched task contexts, and uses it to bias or repair search while preserving terminal-task evaluation. The primary evidence is intentionally tiered. Local tasks and matched ManiSkill diagnostics isolate reward-center mechanisms and ablations; matched option-level ManiSkill sweeps test robustness under primitive failure, observation noise, and initial-pose shifts while not claiming standard benchmark superiority; and an official same-backbone OpenVLA-OFT/LIBERO bridge tests bounded VLA action repair. The OpenVLA-OFT clean reproduction reaches 10/10 LIBERO-Spatial successes both with and without RCRM-Guard. A single-suite same-backbone action-channel stress artifact over ten paired LIBERO-Spatial action-channel stress episodes records 0/10 unguarded successes and 9/10 guarded successes. Additional observation-noise, language-perturbation, and visual-distractor probes are reported as coverage and negative-result context rather than superiority evidence. The resulting claim is bounded: Reward-Centered ReST-MCTS is an inspectable test-time verifier for same-backbone high-uncertainty manipulation, not a replacement VLA policy or a broad standard-benchmark superiority claim.
CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: http://craft-hand.github.io/
comment: In RSS 2026. Website: https://roboticsconference.org/program/papers/192/
Cosmos 3: Omnimodal World Models for Physical AI
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.
Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots
Tendon-Driven Continuum Robots (TDCRs) pose significant control challenges due to their highly nonlinear, path-dependent dynamics and non-Markovian characteristics. Traditional Jacobian-based controllers often struggle with hysteresis-induced oscillations, while conventional learning-based approaches suffer from poor generalization to out-of-distribution trajectories. This paper proposes a reference-augmented offline learning framework for precise 6-DOF tracking control of TDCRs. By leveraging a differentiable RNN-based dynamics surrogate as a gradient bridge, we optimize a control policy through an augmented reference distribution. This multi-scale augmentation scheme incorporates stochastic bias, harmonic perturbations, and random walks, forcing the policy to internalize diverse tracking error recovery mechanisms without additional hardware interaction. Experimental results on a three-section TDCR platform demonstrate that the proposed policy achieves a 50.9\% reduction in average position error compared to non-augmented baselines and significantly outperforms Jacobian-based methods in both precision and stability across various speeds.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO
Recent works use a neuro-symbolic framework for general manipulation policies. The advantage of this framework is that -- by applying off-the-shelf vision and language models -- the robot can break complex tasks down into semantic subtasks. However, the fundamental bottleneck is that the robot needs skills to ground these subtasks into embodied motions. Skills can take many forms (e.g., trajectory snippets, motion primitives, coded functions), but regardless of their form skills act as a constraint. The high-level policy can only ground its language reasoning through the available skills; if the robot cannot generate the right skill for the current task, its policy will fail. We propose to address this limitation -- and dynamically expand the robot's skills -- by leveraging user feedback. When a robot fails, humans can intuitively explain what went wrong (e.g., ``no, go higher''). While a simple approach is to recall this exact text the next time the robot faces a similar situation, we hypothesize that by collecting, clustering, and re-phrasing natural language corrections across multiple users and tasks, we can synthesize more general text guidance and coded skill templates. Applying this hypothesis we develop Memory Enhanced Manipulation (MEMO). MEMO builds and maintains a retrieval-augmented skillbook gathered from human feedback and task successes. At run time, MEMO retrieves relevant text and code from this skillbook, enabling the robot's policy to generate new skills while reasoning over multi-task human feedback. Our experiments demonstrate that using MEMO to aggregate local feedback into general skill templates enables generalization to novel tasks where existing baselines fall short. See supplemental material here: https://collab.me.vt.edu/memo
Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset
Meaningful human-robot interaction (HRI) requires a robot to continuously assess user engagement through persistent user tracking. However, state-of-the-art Multi-Object Tracking models are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans moving in unpredictable nonlinear patterns, obstructing each other, or leaving and reentering the scene. These dynamics trigger frequent identity switches (IDSW), causing the robot to lose its footing mid-conversation. To address this, we introduce a focused, custom-annotated egocentric dataset collected via the Furhat robot. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended memory and appearance re-identification (ReID). Results indicate that increasing temporal memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49% compared to a standard tracking-by-detection baseline, effectively mitigating interaction breakdowns. As standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.
comment: 8 pages, 5 figures, 3 tables. Camera-ready version. Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
Policy Gradient with Self-Attention for Model-Free Distributed Nonlinear Multi-Agent Games IROS 2026
Multi-agent games in dynamic nonlinear settings are challenging due to the time-varying interactions among the agents and the non-stationarity of the (potential) Nash equilibria. In this paper we consider model-free games, where agent transitions and costs are observed without knowledge of the transition and cost functions that generate them. We propose a novel distributed policy structure that follows the communication constraints in multi-team games, with multiple agents per team, and learned through policy gradients. Our formulation is inspired by the structure of distributed policies in linear quadratic games, which take the form of time-varying linear feedback gains. In the nonlinear case, we model the policies as nonlinear feedback gains, parameterized by self-attention layers to account for the time-varying multi-agent communication topology. We demonstrate that our approach achieves strong performance in several settings, including distributed linear and nonlinear regulation, and simulated and real multi-robot pursuit-and-evasion games.
comment: The paper has been accepted and will be presented at IEEE/RSJ IROS 2026
H-RINS: Hierarchical Tightly-coupled Radar-Inertial State Estimation via Smoothing and Mapping
Millimeter-wave radar enables robust perception in visually degraded environments, yet radar-inertial estimation remains prone to drift: sparse body-frame velocity measurements weakly constrain absolute orientation, leaving IMU biases poorly observable over the short horizons of sliding-window estimators. We propose a tightly coupled, hierarchical radar-inertial factor graph that decouples estimation into a high-rate resetting graph and a persistent global graph. The resetting graph fuses IMU preintegration, radar velocities, and adaptive ZUPT to produce smooth, low-latency odometry for real-time control. The persistent graph maintains a full state (poses, velocities, and biases) via keyframe-based geometric mapping and loop closures. Fully observable biases and their exact covariances are continuously injected from the persistent graph as priors into the resetting graph, anchoring the high-rate estimator against integration drift. Extensive evaluations demonstrate high accuracy and drift-reduced estimation at faster than real-time speeds. Code and datasets will be released upon paper acceptance.
comment: 8 pages, 8 figures
TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation
Manipulating fragile deformable containers, such as disposable plastic cups filled with liquid, demands real-time grip-force adaptation within an extremely narrow force margin: insufficient force causes slip, while excessive force irreversibly deforms the thin wall. Existing approaches struggle to achieve such force-sensitive manipulation tasks. We propose a noise-statistics-based calibration-driven reflex control paradigm with vision-based tactile sensing: by analyzing the sensor's intrinsic noise characteristics (via a brief static-hold-and-unload protocol), we directly derive all controller thresholds, eliminating external force calibration, trial-and-error manual tuning, or material-specific physical models. Instantiating this paradigm, we present TactileReflex, a three-channel closed-loop controller that extracts three image-level proxies, shear intensity ($S_y$), contact intensity ($F_n$), and center of pressure ($C$), from dual visuo-tactile sensors and drives prioritized reflex channels at ~12 Hz for slip suppression, weight-adaptive release, and force protection. Each channel closes the loop directly on its proxy via noise-derived thresholds. Ablation demonstrates that only the full three-channel system is able to prevent irreversible container deformation (5/5 success vs. at most 1/5 for partial configurations). In a dynamic pouring task, fixed-effort baselines fail in all 10 attempts due to pose drift, while TactileReflex achieves 9/10 success across two water volumes. As a self-contained and interpretable controller, TactileReflex can serve as a plug-and-play safety layer beneath high-level manipulation pipelines, including haptic-free VR teleoperation and vision-language-action (VLA) policies.
comment: 8 pages, 4 figures, 6 tables
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-training ECCV 2026
Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.
comment: Accepted to ECCV 2026. Maciej and Jesper contributed equally
Dynamics, stability, and energy efficiency of an energy-recycling rimless wheel with spring-clutch legs
This paper proposes an energy-recycling rimless wheel with spring-clutch legs. The proposed mechanism uses a lockable clutch to store part of the impact-induced elastic energy after foot contact and reinject it in the next gait cycle. First, we develop a hybrid dynamic model of the energy-recycling rimless wheel. Second, numerical simulations are used to examine the dynamics, local stability of periodic gaits, and the Cost of Transport (CoT) of the proposed mechanism. The simulation results show that the proposed mechanism reduces the CoT by up to 16.13% compared with a benchmark viscoelastic-legged rimless wheel with telescopic spring-damper legs. Compared with the rigid rimless wheel, the viscoelastic-legged and energy-recycling models reduce the CoT by more than 50%. The energy-recycling model also maintains locally stable periodic gaits over the tested slope and stiffness ranges. Finally, prototype experiments on an inclined plane are conducted to examine the feasibility of the proposed mechanism. The experimental results show that the proposed rimless wheel achieves passive walking on a shallow 1° slope, corresponding to a CoT of approximately 0.02. These results suggest that the proposed spring-clutch mechanism can improve the simulated walking efficiency of the energy-recycling rimless wheel, while the prototype experiments support the feasibility of passive walking with the mechanism.
Learning-Based Dynamics Modeling and Robust Control for Tendon-Driven Continuum Robots
Tendon-Driven Continuum Robots (TDCRs) pose significant modeling and control challenges due to complex nonlinearities, such as frictional hysteresis and transmission compliance. This paper proposes a differentiable learning framework that integrates high-fidelity dynamics modeling with robust neural control. We develop a GRU-based dynamics model featuring bidirectional multi-channel connectivity and residual prediction to effectively suppress compounding errors during long-horizon auto-regressive prediction. By treating this model as a gradient bridge, an end-to-end neural control policy is optimized through backpropagation, allowing it to implicitly internalize compensation for intricate nonlinearities. Experimental validation on a physical three-section TDCR demonstrates that our framework achieves accurate tracking and superior robustness against unseen payloads, outperforming Jacobian-based methods by eliminating self-excited oscillations. For implementation details and source code, please refer to https://github.com/ZiqingZou/ContinuumControl.
MILE: A Mechanically Isomorphic Hand Exoskeleton and Visuotactile Robotic Hand for Data Collection in Dexterous Manipulation
Dexterous robotic hands are expected to perform complex, contact-rich object manipulation, but learning such skills remains challenging because high-dimensional hands require high-fidelity demonstrations. Imitation learning provides a practical route for acquiring dexterous manipulation skills from human demonstrations, yet collecting synchronized multimodal demonstrations with accurate hand actions and tactile observations remains a key bottleneck. We present MILE, a teleoperation-based data-collection system comprising the human-first MILE exoskeleton and the mechanically corresponding MILE-Tac robotic hand. The system integrates custom-designed and fabricated modular joint encoders and compact MILE fingertip visuotactile sensor modules. The exoskeleton is informed by human-hand anatomy and ergonomic constraints, while the robotic hand is co-designed to preserve the selected four-finger kinematic topology. This correspondence enables joint-space command transfer and reduces reliance on task-space IK-based retargeting. The system synchronously records task-specific visual observations, four fingertip visuotactile streams, robot-hand proprioception, and exoskeleton-derived action commands. We evaluate MILE through a four-task teleoperation benchmark against representative glove-based and vision-based interfaces, and through imitation-learning experiments that compare policies trained with and without fingertip tactile input. The project page is available at https://sites.google.com/view/mile-system.
comment: 18 pages including supplementary material. Main manuscript and supplementary material included in this version
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data
Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains. Videos and open-source code can be found on our project website: https://chenyt31.github.io/wh0.github.io/.
comment: Under review. The first three authors contributed equally to this work
Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation
Cross-embodiment data have become central to training robotic foundation models. To leverage such heterogeneous data, we focus on flow-based object manipulation, where robot flows (robot velocity fields) serve as embodiment-agnostic motion representations. Previous studies do not formulate robot flows as dense velocity fields, but as displacements of sparse keypoints, while such velocity fields better match the continuous-time nature of motions. We propose Flow as Flow, a framework that models robot flows as probability flows based on a flow matching formulation. By naturally modeling such velocity fields within this formulation, our method achieves efficient and high-quality robot flow generation. Across standard benchmarks, our method outperforms representative baseline methods on standard metrics, while achieving approximately 33$\times$ faster generation. Furthermore, through real-world experiments evaluating 9 methods with 260 trials per method across 13 manipulation tasks, we show that our method achieves a higher average success rate than the baseline methods. Our project page is available at https://flow-as-flow-u0n5y.kinsta.page.
A Pairwise Human-Human Interaction Detection and Recognition Framework for Mobile Service Robots
Autonomous mobile service robots, such as lawnmowers or cleaning robots, operating in human-populated environments need to reason about human-human interactions to support safe and socially aware navigation. For such systems, interaction understanding is not primarily a fine-grained recognition problem, but a perception problem under limited sensing quality and computational resources. Many existing approaches focus on holistic group activity recognition, often relying on complex and computationally expensive models that are not well suited for mobile robotic platforms. In this work, we argue that pairwise human interactions constitute a minimal yet sufficient perceptual unit for robot-centric social understanding. We study the problem of identifying interacting person pairs and classifying coarse-grained interaction behaviors sufficient for downstream group-level reasoning and robot decision-making. To this end, we adopt a two-stage framework in which candidate interacting pairs are first identified using lightweight geometric and motion cues, and interaction types are subsequently classified using a relation network. We evaluate the proposed approach on the JRDB dataset, where it achieves competitive performance with reduced computational cost and model size compared to appearance-based methods. Additional experiments on the Collective Activity Dataset (CAD) and zero-shot evaluation on a lawnmower-collected dataset further demonstrate the generalizability of the proposed framework. These results suggest that simple geometric and motion cues provide a practical and efficient basis for interaction-aware perception in mobile service robots. Code is released.
TEXEDO : Test Time Scaling for Controller-aware Language-conditioned Humanoid Motion Generation
Text-conditioned motion generation is a promising interface for programming humanoid robots, yet current generators are often trained on human motion datasets retargeted to robot morphologies. Although such data provides rich semantic and kinematic priors, it fails to capture the nuances of whole-body tracking controllers, including balance, contact dynamics, actuation limits, and controller-specific failure modes. As a result, generated motions can be semantically plausible but difficult or impossible for the robot to execute. We introduce TEXEDO, a test-time scaling framework for humanoid motion generation that improves motion quality without requiring a stronger underlying generator. Given a text prompt, TEXEDO samples multiple candidate motions from a pretrained text-conditioned generator and selects the best motion that is both executable and task-aligned. The reward model combines a dynamic feasibility verifier, distilled from whole-body tracking rollouts to predict physical executability, with a semantic alignment verifier that measures text-motion alignment in a learned co-embedding space. Our pipeline treats dynamic feasibility as a hard constraint and semantic alignment as the selection objective within the feasible set. Through large-scale simulation studies and real-world deployment on a Unitree G1 humanoid robot, we show that TEXEDO consistently improves both tracking fidelity and text alignment. These results demonstrate that grounded verification is an effective path toward deployable language-guided humanoid motion generation. Project website: https://jianuocao.github.io/TEXEDO/
PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot
Vine robots, a class of soft, growing robots, are suitable for navigating complex and confined environments due to their compliant bodies and self-supporting growth mechanism. However, hysteresis, tether interactions, and deformations make them difficult to predict and model, which in turn limits the effectiveness of conventional planning and control approaches. In this work, we present a data-driven, vision-based control framework for the first autonomous vine robot system. Our system integrates 19 cameras distributed along the robot's body to provide comprehensive feedback of both the robot state and the surrounding environment. Using this rich whole-body vision feedback, we train an end-to-end visuomotor policy from demonstrations for closed-loop autonomous control in complex environments. The policy efficiently aggregates information from distributed sensing while maintaining robustness to inaccurate robot states and actuation. Experimental results demonstrate that the learned policy enables robust navigation and manipulation in challenging scenarios, including steering through branched structures, climbing up slopes, traversing unsupported terrain, reaching objects precisely, and maneuvering through confined spaces and obstacles. Project website https://panovine-bot.github.io
MuTRAP: Multi-trigger Trojans Attacking Robot Task Planning Systems IROS
Robots need task planning methods to achieve goals that require more than one action. Recently, large pretrained models have demonstrated impressive performance in task planning. For instance, large language models (LLMs) can generate task plans using action and goal descriptions. Despite the rapid progress of large models in robot intelligence, their security implications remain only partially understood, leaving important gaps in the exploration of potential vulnerabilities in LLM-driven robotic planning systems. To investigate such risks, in this paper, we develop MuTRAP, the first multi-trigger trojan attack specifically designed and targeted for LLM-assisted robot task planners. MuTRAP follows the standard practice of LLM usage in robotics where the backbone LLM is typically frozen and hosted in a central server limiting attacker's reach. In contrast, MuTRAP injects backdoor using a small set of task-specific parameters. In addition, we develop a trigger optimization method for selecting multiple-trigger words that are most effective for different robot applications. For instance, one can use unique trigger word "herical" to activate a specific malicious behavior, e.g., cutting hand on a kitchen robot. Through MuTRAP that demonstrates the vulnerability of current LLM-based planners, our goal is to promote the development of secured robot intelligence. Details and demos are provided in: https://mutrap.github.io/MuTRAP/
comment: Accepted for publication at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors
Accurate dynamics models are critical for informed decision-making in robotic systems, particularly for agile aerial vehicles operating under uncertainty. Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time. Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space, yet prior JEPA-style methods for robot navigation have been studied primarily for kinematic-level planning, with limited investigation in high-frequency control. In this work, we introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state, enabling physically grounded long-horizon prediction. Additionally, we combine the learned model with a sampling-based optimal control solution to take advantage of its predictive capabilities for real-time control on embedded hardware. Finally, to reduce the dependence on expensive and unsafe real-world data collection, we develop a structured pipeline for automated dataset generation. Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions.
comment: Under Review
Prior Reinforce: Goal-Conditioned Dynamic Manipulation with Limited Trials IROS 2026
Embodied robots have achieved strong performance in many real-world manipulation tasks, yet agile dynamic manipulation remains challenging due to high sensitivity to motion parameters and sparse outcome-level feedback. Tasks such as shooting a basketball into a hoop require precise control of fast open-loop motions, where small trajectory variations can lead to large outcome deviations, making data-efficient adaptation difficult for existing methods that rely on large-scale interaction, reward engineering, or accurate dynamic modeling. We propose Prior Reinforce (P.R.), a simple and practical framework for goal-conditioned dynamic manipulation. The method first learns a structured motion manifold from a small set of demonstrations using a conditional diffusion model, and then adapts motions toward new goals through feedback-driven optimization in a low-dimensional condition space. By separating motion generation from outcome-driven adaptation, the framework enables efficient refinement using only a small number of real-world trials under noisy perception. Experiments on multiple real-world dynamic manipulation tasks demonstrate that P.R. reliably achieves new goals within as few as ten total trials while remaining robust to perception noise and hardware uncertainty, suggesting a practical approach for low-trial real-world robot adaptation. Project website: https://adap-robotics.github.io/.
comment: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
RTFF: Random-to-Target Fabric Flattening Policy using Dual-Arm Manipulator
Robotic fabric manipulation remains challenging due to fabric deformability and occlusions from wrinkles and the manipulator. This paper defines Random-to-Target Fabric Flattening (RTFF) as the task of bringing a randomly wrinkled fabric to an arbitrary user-specified wrinkle-free target pose. RTFF requires simultaneous flattening and pose alignment, where the two objectives are inherently coupled since flattening the fabric displaces its pose, while realigning it tends to introduce wrinkles. To solve this task, this paper anchors both the current and target fabric states to the same template mesh, enabling direct vertex-level wrinkle and pose assessment without registration. Building on this representation, a hybrid Imitation Learning--Visual Servoing (IL--VS) RTFF policy is proposed. A novel Mesh Action Chunking Transformer (MACT) leverages structured mesh observations to achieve goal-conditioned coarse alignment from a compact demonstration set, after which VS ensures precise convergence to the target. The policy is validated on a real dual-arm teleoperation system, demonstrating precise alignment to unseen target poses, fabric types, and scales. Code and videos: https://kaitang98.github.io/RTFF_Policy/
comment: 8 pages, 7 figures, conference
PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM
Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.
Debate2Create: Robot Co-design via Multi-Agent LLM Debate
We introduce Debate2Create (D2C), a multi-agent LLM framework that formulates robot co-design as structured, iterative debate grounded in physics-based evaluation. A design agent and control agent engage in a thesis-antithesis-synthesis loop, while criterion-specific LLM judges provide multi-objective feedback to steer exploration. Across five MuJoCo locomotion benchmarks, D2C achieves the highest default-normalized score among the evaluated LLM-based and black-box baselines, with gains up to 3.2x on Ant and nearly 9x on Swimmer. Iterative debate yields 18-35% gains over compute-matched zero-shot generation, and D2C-generated rewards transfer to default morphologies in 4/5 tasks. These results suggest that structured, simulator-grounded multi-agent interaction is a useful mechanism for joint morphology-reward optimization under a fixed-topology, per-candidate-RL protocol. Project page: debate2create.github.io.
RN-D: Discretized Categorical Actors for On-Policy Reinforcement Learning ICML 2026
On-policy Reinforcement Learning (RL) remains a dominant paradigm for continuous control, yet standard implementations rely on Gaussian actors and relatively shallow MLP policies, often leading to brittle optimization when gradients are noisy, and policy updates must be conservative. In this paper, we revisit actor policy representation as a first-class design choice for on-policy RL. We study discretized categorical actors, which represent each action dimension as a distribution over discrete bins and induce a policy objective analogous to classification cross-entropy loss. Building on architectural advances from supervised learning, we further pair discretized categorical actors with regularized networks, yielding RN-D. Across diverse continuous-control benchmarks, we show that simply replacing the standard Gaussian actor with our proposed actor substantially improves performance, achieving state-of-the-art results within on-policy RL. We release our code at https://github.com/alwaysbyx/RND-RL.
comment: Accepted by ICML 2026
Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which must compute complete trajectories before execution and therefore may incur substantial planning latency before actions can be taken, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents Anytime Closed-Loop Conflict-Based Search (ACCBS), a closed-loop algorithm built on a finite-horizon variant of Conflict-Based Search (CBS) with a horizon-changing mechanism inspired by iterative horizon-deepening in Model Predictive Control (MPC). ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS achieves a favorable balance between computational efficiency, solution quality, and execution flexibility, while naturally accommodating online disturbances through its closed-loop formulation.
Sampling Strategies for Robust Universal Quadrupedal Locomotion Policies
This work focuses on sampling strategies of configuration variations for generating robust universal locomotion policies for quadrupedal robots. We investigate the effects of sampling physical robot parameters and joint proportional-derivative gains to enable training a single reinforcement learning policy that generalizes to multiple parameter configurations. Three fundamental joint gain sampling strategies are compared: parameter sampling with (1) linear and polynomial function mappings of mass-to-gains, (2) performance-based adaptive filtering, and (3) uniform random sampling. We improve the robustness of the policy by biasing the configurations using nominal priors and reference models. All training was conducted using the RaiSim simulation environment, tested in simulation on a range of diverse quadrupeds, and zero-shot deployed onto hardware using the ANYmal quadruped robot. Compared to multiple baseline implementations, our results demonstrate the need for significant joint controller gains randomization for robust closing of the sim-to-real gap.
Multiagent Systems
Generating Realistic Individual Activity Schedules via Activity Location Allocation Based on Simulated Travel Times
Individual level daily activity schedules are essential for a wide range of applications, including infectious disease control, urban transportation planning, and policy design. In practice, such schedules are typically generated by combining population data with travel survey data. These data sources are used because they are often publicly available, whereas observed individual activity schedules are difficult to obtain due to privacy concerns. However, because of the complexity of mobility modelling, it is difficult to generate realistic activity schedules that also preserve travel times consistent with those reported in travel surveys. To address this issue, we propose a framework for generating activity schedules that iteratively applies a dynamic programming method to allocate activity locations based on simulated travel times. Numerical experiments with dummy data show that the proposed method reduces the discrepancy between simulated travel times and those reported in travel surveys by 52.2% relative to the first iteration through iterative refinement.
comment: 8 pages, 5 figures. This is the author version of a short paper accepted for presentation in the poster session at the 17th Conference on Spatial Information Theory (COSIT 2026)
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) -- the least established, and the only one we label exploratory -- a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty -- their belief-tracking, spontaneous deception, and per-model cognitive "personas" -- which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.
comment: 25 pages including appendices, 8 figures, 4 tables; appendices include verbatim system prompt and engine resolution pseudocode. All correlations reported with p-values, 95% bootstrap confidence intervals and Spearman's rho; includes a Steiger test and Bradley-Terry fit
Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy
Large language models are making research production scalable, shifting the bottleneck from producing artifacts to judging claims. We present \textsc{Agon}, a research orchestrator that validates what can be checked inside the workflow and leaves the remaining judgments to human scientists. \textsc{Agon} is built on six design principles: Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. We ran \textsc{Agon} across domains for 444 iterations of Prompt Economy loops, using only small starting topics and no human-written experimental code. These deployments demonstrate scalability while exposing new classes of failure. We organize these failures into a taxonomy along severity, fixability, visibility, and capability locus. The taxonomy separates failures the loops can see and fix from those that require human judgment. Together, these results show that \textsc{Agon} is pushing research toward a new paradigm: machine scales, human steers.
Phoneme-Level Mispronunciation Screening in Polish-Speaking Children with an Explainable Assistant INTERSPEECH 2026
Early identification of speech sound errors in children is often limited by access to specialists, motivating lightweight screening tools that can operate outside the clinic. We present a screening pipeline for Polish-speaking children focused on sibilant substitutions, coupling a wav2vec2-based CTC token recognizer with alignment-based error typing and a template-grounded caregiver assistant for screening, not diagnosis. On a held-out test set of 10 unseen children comprising 559 utterances, the recognizer achieves 88.7 percent exact sequence match. As a conservative screening proxy, we flag a mismatch when the system emits substitution-evidence bracketed tokens at the target segment, yielding 72.9 percent precision, 61.4 percent recall, F1 = 0.67, and a 2.7 percent false-alarm rate on target-correct items. We describe the assistant's safety boundaries and outline a clinician-in-the-loop validation plan for future deployment.
comment: Accepted to INTERSPEECH 2026. 5 pages, 1 figure, 4 tables
GCT-MARL: Graph-Based Contrastive Transfer for Sample-Efficient Cooperative Multi-Agent Reinforcement Learning
In cooperative multi-agent reinforcement learning (MARL), from a deployment perspective, it is challenging and expensive to train agents from scratch for each new environment or task. In this work, we propose GCT-MARL, a transfer learning framework that builds on the multi-view graph contrastive backbone of MAIL and augments it with a per-view, adaptively weighted alignment loss and a two-phase training protocol specifically designed for transfer across populations of varying sizes and compositions. We empirically demonstrate that the proposed framework markedly accelerates convergence on the target task relative to from-scratch training, in both homogeneous (within-faction, varying N) and heterogeneous (cross-faction and mixed unit-type) transfer scenarios. Furthermore, we show that the framework naturally supports continual learning by sequentially chaining the two-phase transfer protocol across a series of related tasks. Overall, this work provides a unified approach to mitigating key limitations in current MARL transfer methods with new insights at both methodological and empirical levels.
comment: Accepted at The Continual RL Workshop, RLC 2026
Kiko: Programming Agents to Enact Interaction Protocols
Realizing a multiagent system involves implementing member agents who interact based on a protocol while making decisions in a decentralized manner. Current programming models for agents offer poor abstractions for decision making and fail to adequately bridge an agent's internal decision logic with its public decisions. We present Kiko, a protocol-based programming model for agents. To implement an agent, a programmer writes one or more decision makers, each of which chooses from among a set of valid decisions and makes mutually compatible decisions on what messages to send. By completely abstracting away the underlying communication service and by supporting practical decision-making patterns, Kiko enables agent developers to focus on business logic. We provide an operational semantics for Kiko and establish that Kiko agents are protocol compliant and able to realize any protocol enactment.
Policy Gradient with Self-Attention for Model-Free Distributed Nonlinear Multi-Agent Games IROS 2026
Multi-agent games in dynamic nonlinear settings are challenging due to the time-varying interactions among the agents and the non-stationarity of the (potential) Nash equilibria. In this paper we consider model-free games, where agent transitions and costs are observed without knowledge of the transition and cost functions that generate them. We propose a novel distributed policy structure that follows the communication constraints in multi-team games, with multiple agents per team, and learned through policy gradients. Our formulation is inspired by the structure of distributed policies in linear quadratic games, which take the form of time-varying linear feedback gains. In the nonlinear case, we model the policies as nonlinear feedback gains, parameterized by self-attention layers to account for the time-varying multi-agent communication topology. We demonstrate that our approach achieves strong performance in several settings, including distributed linear and nonlinear regulation, and simulated and real multi-robot pursuit-and-evasion games.
comment: The paper has been accepted and will be presented at IEEE/RSJ IROS 2026
An Explanation-oriented Inquiry Dialogue Game for Expert Collaborative Recommendations
This work presents a requirement analysis for collaborative dialogues among medical experts and an inquiry dialogue game based on this analysis for incorporating explainability into multiagent system design. The game allows experts with different knowledge bases to collaboratively make recommendations while generating rich traces of the reasoning process through combining explanation-based illocutionary forces in an inquiry dialogue. The dialogue game was implemented as a prototype web-application and evaluated against the specification through a formative user study. The user study confirms that the dialogue game meets the needs for collaboration among medical experts. It also provides insights on the real-life value of dialogue-based communication tools for the medical community.
When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments
Can AI Agents simulate real-world trading environments to investigate the impact of external factors on stock trading activities (e.g., macroeconomics, policy changes, company fundamentals, and global events)? These factors, which frequently influence trading behaviors, are critical elements in the quest for maximizing investors' profits. Our work attempts to solve this problem through large language model based agents. We have developed a multi-agent AI system called StockAgent, driven by LLMs, designed to simulate investors' trading behaviors in response to the real stock market. The StockAgent allows users to evaluate the impact of different external factors on investor trading and to analyze trading behavior and profitability effects. Additionally, StockAgent avoids the test set leakage issue present in existing trading simulation systems based on AI Agents. Specifically, it prevents the model from leveraging prior knowledge it may have acquired related to the test data. We evaluate different LLMs under the framework of StockAgent in a stock trading environment that closely resembles real-world conditions. The experimental results demonstrate the impact of key external factors on stock market trading, including trading behavior and stock price fluctuation rules. This research explores the study of agents' free trading gaps in the context of no prior knowledge related to market data. The patterns identified through StockAgent simulations provide valuable insights for LLM-based investment advice and stock recommendation. The code is available at https://github.com/MingyuJ666/Stockagent.
comment: 33 pages, 10 figures. Published in ACM Transactions on Intelligent Systems and Technology (TIST)
Computing Evolutionarily Stable Strategies in Imperfect-Information Games
We present an algorithm for computing evolutionarily stable strategies (ESSs) in symmetric perfect-recall extensive-form games of imperfect information. Our main algorithm is for two-player games, and we describe how it can be extended to multiplayer games. The algorithm is sound and computes all ESSs in nondegenerate games and a subset of them in degenerate games which contain an infinite continuum of symmetric Nash equilibria. The algorithm is anytime and can be stopped early to find one or more ESSs. We experiment on an imperfect-information cancer signaling game as well as random games to demonstrate scalability.
Debate2Create: Robot Co-design via Multi-Agent LLM Debate
We introduce Debate2Create (D2C), a multi-agent LLM framework that formulates robot co-design as structured, iterative debate grounded in physics-based evaluation. A design agent and control agent engage in a thesis-antithesis-synthesis loop, while criterion-specific LLM judges provide multi-objective feedback to steer exploration. Across five MuJoCo locomotion benchmarks, D2C achieves the highest default-normalized score among the evaluated LLM-based and black-box baselines, with gains up to 3.2x on Ant and nearly 9x on Swimmer. Iterative debate yields 18-35% gains over compute-matched zero-shot generation, and D2C-generated rewards transfer to default morphologies in 4/5 tasks. These results suggest that structured, simulator-grounded multi-agent interaction is a useful mechanism for joint morphology-reward optimization under a fixed-topology, per-candidate-RL protocol. Project page: debate2create.github.io.
Systems and Control (EESS)
Multi-Worker Assembly Line Rebalancing with Relevance-Guided Configuration Preservation
In assembly line balancing, tasks are assigned to stations in order to satisfy a required cycle time. When production conditions change, the line must be rebalanced by modifying the current task allocation, typically aiming to move as few tasks as possible between stations. Similarity measures are commonly used to control such changes, but they generally evaluate configuration preservation by treating all tasks equally, which may not reflect their different practical importance. In this work, a \emph{pruned Mean Similarity Factor} is proposed for assembly line rebalancing, evaluating similarity only over a subset of structurally relevant tasks identified through a relevance score. The proposed measure is integrated into a compact mixed-integer linear programming (MILP) formulation that considers practical aspects of manual assembly, specifically workload balance, ergonomic exposure, multi-worker stations, and positional constraints. Computational experiments on extended benchmark instances derived from the literature show that the proposed approach can obtain optimal rebalancing solutions within reasonable computational times, while maintaining high task colocation and balanced workload and ergonomic distributions. In particular, focusing the similarity evaluation on relevant tasks helps reduce the computational effort.
Suboptimal and Reduced-Order MPC via Timescale Separation
In this paper, we propose a generalized framework for the design and analysis of suboptimal and reduced-order nonlinear Model Predictive Control (MPC) architectures. The proposed framework manages real-time operation of MPC schemes by (i) computing the control action suboptimally, i.e., by running a generic optimal control algorithm for a finite number of iterations, and (ii) relying on a reduced-order model that neglects part of the plant dynamics (accounting for, e.g., unmodeled dynamics or a low-level compensator). To rigorously handle the interplay between optimization error and model mismatch, we treat the sampling time as a tunable design parameter. We analyze the resulting closed-loop system, comprising the full-order physical plant interconnected with the iterative optimization algorithm (treated as a dynamical system), by leveraging tools from timescale separation. We prove that operating at a sufficiently fast sampling rate ensures that the closed-loop system maintains recursive feasibility and achieves an exponentially stable equilibrium point. The effectiveness of the proposed framework is validated on an underactuated two-link robotic arm through virtual experiments in the high-fidelity MuJoCo physics engine.
Human-Robot Shared Control for Humanized End-Effector Teleoperation
Recent advances in robotics have enabled robots to operate in shared human environments, emphasizing the importance of effective human robot interaction HRI. Prior studies indicate that anthropomorphism, defined as the incorporation of human like features into robotic systems, facilitates more natural interaction and enhances both task performance and user experience. In robotic arm teleoperation, however, user controlled motions often deviate from human like kinematic characteristics due to intrinsic limitations of teleoperation systems. In this work, we propose a real time framework that generates human like end effector trajectories based on the two thirds power law of voluntary human hand movements, while preserving the operators intended control inputs. The proposed approach is validated through real world experiments conducted on a 6 degree of freedom Dobot CR10 robotic arm. Quantitative analysis demonstrates that the generated trajectories exhibit significantly stronger adherence to human like kinematic profiles compared to conventional teleoperation, with the estimated beta coefficient moving 39.7% closer on average to the theoretical value of 1/3. Furthermore, the method achieves an approximate 34% improvement in motion smoothness, measured by RMS torque rate reduction, with 80% of evaluated motion patterns showing statistically significant improvements while maintaining comparable task completion times.
CONDUCTOR: An LLM-Orchestrated Digital Twin for Uncertainty-Aware Distribution Grid Operations
Large language models (LLMs) are proposed as natural-language interfaces to power system analysis, yet existing frameworks are validated almost exclusively on synthetic benchmarks and support only deterministic studies. We present CONDUCTOR, an LLM-orchestrated digital twin for distribution grid operations. An open-weights LLM orchestrates power system analysis and optimization solvers and, unlike prior systems, also performs uncertainty-aware studies: probabilistic security assessment, robust corrective dispatch, and flexibility-envelope and hosting-capacity characterization. We test it on the Bornholm 60 kV distribution network - a real Danish island power system - using one year of smart-meter measurements. An operator case study spans deterministic assessment, probabilistic risk quantification, and robust dispatch. Across a 68-prompt behavioral catalog scoring tool use, evidence consistency, state-mutation discipline, and refusal calibration, the orchestrator answers 98.5% of tasks correctly on the first attempt - the lone failure being a missing answer, not a wrong one. The full pipeline is released open source.
Multiplayer Reach-Avoid Differential Games with Defender-Side Information Delay
We consider a class of pursuit-evasion games in which multiple defenders and attackers move in the plane with bounded speeds, while each defender observes the states of other agents with a constant time delay. For the one-attacker-one-defender case, we derive an explicit analytical characterization of the attacker's delayed attack region and prove its convexity under mild assumptions. When the defender can guarantee capture, we formulate a convex optimization problem to compute the capture point and derive optimal strategies for both players. These strategies are shown to constitute a subgame-perfect Nash equilibrium by exploiting the sequential structure induced by the information delay. The analysis is further extended to the one-attacker-multiple-defender scenario and to the general multiplayer setting. In the latter case, delay-aware pairwise winning relations are incorporated into a maximum matching formulation to address the defender-attacker assignment. Numerical simulations for one-on-one, one-vs-multiple, and multi-agent cases validate the theoretical results and illustrate the impact of information delay on game outcomes and optimal strategies.
Trade-off invariance for weighted scalarizations in multi-objective optimization
We consider weighted-sum scalarizations for an abstract multi-objective minimization problem defined by the vector-valued map $U\ni u\mapsto ( f_1(u),\ldots, f_N(u))$, where $U$ is an arbitrary nonempty set and no topology, convexity, compactness, or lower semicontinuity assumption is imposed. Using the open simplex as parameter space for positive weights, we show that the Trade-off Invariance Principle for scalarizations yields a generic uniqueness property in the objective space. Namely, for almost every weight vector, all minimizers of the corresponding weighted-sum scalarization have the same objective vector. Moreover, excluding again a null-measure subset, all minimizing sequences determine the same limiting objective vector, independently of the chosen sequence. We also give a geometric interpretation of these results in the attainable objective set: for almost every positive weight vector, the scalarization exposes at most one nondominated point. Moreover, minimizing sequences determine at most one asymptotically exposed objective vector in the closure of the attainable set.
comment: 9 pages
An Integer Linear Programming Approach for Maximum Power Extraction from Solar PV Plants under Partial Shading
Partial shading in solar photovoltaic (SPV) plants, particularly in urban environments, is a common challenge caused by nearby trees, buildings, or other fixed obstructions, leading to a significant reduction in overall system efficiency. Dynamic and static PV array reconfiguration strategies are widely regarded as effective approaches for mitigating the adverse effects of partial shading. However, Dynamic Array Reconfiguration (DAR) is rarely adopted in practical systems due to high switching complexity and substantial computational requirements. In contrast, Static Array Reconfiguration (SAR) does not require complex switching arrangements or additional computational resources, making it more suitable for real-world implementation. However, SAR is a one-time configuration and cannot adapt to dynamically changing shading conditions. Existing SAR techniques rearrange PV modules based on assumed shading regions rather than the actual shading pattern, which limits their effectiveness under practical, time-varying conditions. In this work, an SAR technique is proposed that explicitly considers the actual shading pattern on the PV array. The proposed approach accounts for shading caused by nearby fixed obstructions that varies throughout the day as well as across different seasons. The performance of the proposed technique was evaluated by comparing it with existing methods considering a PV array with a square matrix, and a small-scale laboratory prototype of non-square matrix was developed to demonstrate its practical applicability in real-world scenarios. It has been observed that the method consistently delivers an optimal power output for both software simulation and practical experiment compared to other available techniques.
Data-Driven Robust MPC for Unknown Nonlinear Systems via Set-Membership Learning
Data-driven model predictive control (MPC) has become an attractive approach for controlling unknown systems, especially when data are corrupted by noise. However, most existing data-driven MPC methods focus on linear systems, and little attention has been given to nonlinear dynamics under disturbances. To fill this gap, we propose a robust data-driven min-max MPC scheme for unknown nonlinear systems with process disturbances. We represent the unknown nonlinear dynamics using vector fields built from a dictionary of basis functions, yielding an equivalent linear form with unknown matrices. These unknown matrices are characterized by a set-membership representation derived from noisy input-state data. Using this uncertainty description, we formulate a min-max MPC problem. Two online scenarios are studied: i) when state measurements are noise-free, and, ii) when they are corrupted by process disturbance. For each case, we derive a Lyapunov-based semidefinite program (SDP) to compute a stabilizing state-feedback controller. The resulting schemes are shown to guarantee recursive feasibility and either exponential or robust stability of the closed-loop system depending on whether there is process disturbance. Simulation studies on benchmark examples illustrate the effectiveness and competitive performance of the proposed approach compared to existing data-driven and model-based controllers.
Control Based Enhanced Regenerative Modes for Hydraulic Multi-Actuator Systems
This paper focuses on a control-based approach for enhancing the regenerative capabilities of hydraulic multi-actuator systems using individual metering valves. Thanks to this architecture, pressure and displacement of each actuator can be controlled nearly independently. By determining online, the right pressure to be driven, it enables the optimization of regenerative control strategies for resistive or driving forces. Globally, this control strategy behaves such as a load sensing approach but each metering valve is piloted in order to activate regenerative mode when it is allowable. The main contribution relies on optimizing the pressure to be controlled in each actuator and the main pump in order to maximize the regenerative capacity of a hydraulic machine while following a displacement. The effectiveness of the proposed approach is proved in simulation. Only a single pump line regeneration is explored here but extensions to multi-pump or direct regeneration are also possible.
From Stabilizing Regions to Certified Controllers: Closing the Selection Gap in Unified PID/PI Analysis for Time-Delay Plants
A recent unified treatment of PID tuning for time-delay plants (An, Tang, Sun, Zhang and Chen, Automatica, 2026) combines the D-partition method with a boundary gradient vector (BGV) to orient the boundaries of stabilizing, relative-stability and stability-margin regions. That method answers a feasibility question, namely where admissible gains lie, and it leaves a manual interior-point test to fix the unstable-pole count in each cell, with the choice of a single controller left to the user. This note makes three contributions. First, the one operation the BGV leaves manual, the absolute unstable-pole count, is available analytically: exactly for delay-free designs through a companion-matrix or Routh count, and through an argument-principle (Mikhailov) evaluation for retarded-type delay loops. Labelling every cell with its analytic count removes the interior-point test and decides the whole partition. Second, we add the step the BGV framework cannot reach, a time-domain selection rule that returns one certified controller: among monotone step responses we choose the minimum-settling-time PI gains, characterized by a tangency condition, with monotonicity guaranteed by external positivity (a nonnegative closed-loop impulse response). Third, we flag a neutral-type pitfall that the unified analysis never delimits: an ideal PID with derivative action on a first-order-plus-dead-time (FOPTD) plant is of neutral type, with a root chain on the imaginary axis when k Kd = T. We reproduce the authors' delay-free benchmark exactly, recovering both admissible Kp intervals, and demonstrate the full pipeline on a FOPTD plant, delivering a certified monotone, fast-settling PI controller that the region-only method can neither locate nor justify; the selected gains match an independent closed-form tangency rule to within one percent. All claims are validated numerically.
Safe Packetized Control for Stochastic Constrained Networked Systems
This work develops a formal framework for the synthesis of packetized safety controllers for discrete-time polynomial stochastic networked control systems (dt-PSNCS) operating under communication constraints, including uplink delays (plant-to-controller) and downlink packet losses (controller-to-actuator). In this setting, the controller is deployed remotely and exchanges information with the plant over an imperfect wireless communication network. Our proposed approach treats the downlink channel as an erasure channel, with packet losses characterized by an independent Bernoulli process. To systematically manage both uplink delays and downlink packet loss, we first introduce a buffer collocated with the plant that accommodates the packetized safety control (PSC) mechanism. We augment the plant and buffer states into a unified augmented-state representation that accurately captures the system evolution in the presence of communication imperfections. Our proposed framework synthesizes safety controllers based on control barrier certificates (CBCs), providing probabilistic safety guarantees that remain robust in the presence of both communication delays and packet losses. To achieve this, we reformulate the safety constraints as a sum-of-squares (SOS) optimization program, thereby facilitating the systematic construction of CBCs and their corresponding safety controllers. We validate the proposed framework through three (physical) case studies, demonstrating its effectiveness and practical applicability.
Resilient Substation Design for 500 Year Storm Events Current State of the Art and Challenges for Floodplain Management and Infrastructure Hardening
Electrical substations are increasingly exposed to non-stationary flood hazards and extreme 500-year storm events intensified by climate change. Approximately 15 to 20 percent of U.S. transmission and distribution assets, representing about 12,000 to 16,000 facilities, are located within FEMA 100-year floodplains, with an additional 8,000 situated in 500-year zones. These vulnerabilities have led to catastrophic outages, as demonstrated by Hurricanes Harvey in 2017 and Ida in 2021, resulting in billions of dollars in economic losses and significant grid instability. This paper presents a comprehensive engineering framework for the Resilient Substation of the Future, integrating site elevation, lime and cement-based soil stabilization, articulating concrete blocks (ACBs) for erosion control, and green infrastructure strategies that align with northern American standards such as ASCE 24 Class IV, NERC CIP014, and FEMA Risk MAP standards. The framework extends traditional gray-infrastructure approaches by incorporating flexible ACB revetments that reduce scour, improve stormwater quality, and support LEED sustainability objectives in heat island reduction and habitat restoration. When combined with pozzolanic soil stabilization methods capable of yielding resilient modulus gains of 10 to 20 times and UCS values approaching 600 psi under favorable conditions, the system enhances flood resilience against 0.2 percent annual exceedance probability events. Phased strategies are discussed and include deployable flood barriers, rapid dewatering systems, GIS-enabled microgrids, and HEC-RAS software-based hydroclimatic modeling, collectively reducing lifecycle costs by up to 25 percent while maintaining operational continuity. Through reduced excavation, enhanced aquifer recharge, and self-healing materials, this framework operationalizes resilience as adaptive, sustainable, and cost-effective.
comment: 18 pages, 4 figures, 6 tables
Buildrix: An Open Platform for Sharing and Benchmarking Agentic AI Skills in Building Engineering
Agentic AI offers significant potential to automate complex building-engineering workflows. However, most existing applications remain isolated proof-of-concept demonstrations and lack reusable domain capabilities, human-verified evaluation cases, and standardized benchmarking infrastructure. This study presents Buildrix, an open, community-driven platform for developing, sharing, executing, and evaluating agentic AI skills for building engineering. Buildrix integrates three components: a Python command-line package for developing, validating, publishing, installing, and managing skills and test cases; a web-based Hub for organizing open challenges, reusable skills, test cases, reviews, and benchmark results; and a local agent harness that supports skill discovery, external toolchain provisioning, progressive context loading, and multi-step workflow execution. Buildrix skills are organized as standardized, self-contained packages containing task instructions, executable scripts, dependencies, and supporting resources. Quantitative test cases can be verified by domain experts and promoted to golden test cases for reproducible benchmark evaluation. Buildrix provides an open foundation for reusable capability development, transparent evaluation, and community-driven advancement of agentic AI in building engineering.
Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute
The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through software-based workload orchestration. This article explores how modern GPU-based AI data centers can operate as grid-interactive assets that respond dynamically to power system conditions. We describe an architecture integrating grid signals, workload scheduling, and power telemetry for fine-grained cluster power control. Experimental results from a real-world deployment on a 130 kW GPU cluster demonstrate multiple forms of flexibility, including rapid load reduction, sustained curtailment, and carbon-aware operation while preserving service levels for priority jobs. We further demonstrate performance-aware load shifting across geographically distributed clusters, enabling workloads to migrate toward regions with lower grid stress. Together, these capabilities transform AI infrastructure from static electricity consumers into flexible resources that support grid reliability, accelerate interconnection, and improve computing sustainability.
comment: 14 pages, 7 figures, 1 table
Toward Next-Generation AI Data Centers: Power Delivery Architecture Shifts, Emerging Technologies, and Challenges
The rapid growth of AI workloads is driving unprecedented increases in data center power demand, current transients, and thermal stress, exposing fundamental limitations in traditional 48 V rack architectures, low-voltage AC distribution, and line-frequency transformer interfaces. This paper reviews the three stages of architectural shifts required to support next-generation AI data centers and identifies three enabling technological building blocks: high-voltage conversion-ratio DC/DC converters, facility-level low-voltage DC distribution, and medium-voltage solid-state transformers. The advantages, technical challenges, and potential solutions associated with each building block are reviewed. Finally, future research directions and open challenges are discussed.
comment: 36 pages, 25 figures. Submitted to Proceedings of IEEE
Solving Markov Decision Processes with Future Information via MPC
Model Predictive Control (MPC) is widely used in industrial and robotic systems for enforcing constraints and embedding domain knowledge through finite-horizon optimization-based planning. However, despite these strengths, an MPC scheme typically does not yield optimal policies for sequential decision-making problems formulated as Markov Decision Processes (MDPs). Recent combinations of MPC with Reinforcement Learning (RL) alleviate this issue by treating MPC as a parameterized model of the optimal policy of an MDP and adjusting its parameters using data. While these approaches typically consider classical MDPs, many real-world problems include future information--such as forecasts, prices, or reference trajectories--at decision time, which must be included in the MDP state for optimal decision-making. Current MPC-RL approaches do not directly account for this augmented-state structure, raising the question of how to incorporate future information into MPC to obtain an optimal policy. This work establishes the structural requirements under which a parameterized MPC can exactly represent the optimal value functions and policy of an MDP with future information. We further demonstrate that such a parameterized MPC can serve as a structured function approximator, with its parameters learned using RL. The approach is illustrated on a point-mass racing task with future reference information.
comment: 6 pages, accepted to IFAC World Congress 2026
Parallel Dynamic Programming for Conic Linear Quadratic Control
Linear Quadratic (LQ) control problems are at the heart of linear control theory and Model Predictive Control (MPC). While performant, standard approaches to solving such problems are inherently serial, limiting real-time scalability despite the parallel computing power available on modern multi-core CPUs. Contributing to addressing this challenge and motivated by ``divide and conquer'' strategies, we present a parallel-in-time approach that solves computationally demanding conic optimal control problems through the use of the alternating direction method of multipliers (ADMM). In particular, we formulate the inner primal update of ADMM as an LQ problem and split the reformulated problem along the time horizon. This enables us to derive a variant of the Riccati recursion using dynamic programming to solve each subproblem in parallel. Numerical benchmarks on two real-world applications demonstrate as much as a 5x speedup compared to existing related approaches on multi-core CPU hardware.
comment: This paper was accepted for presentation at the IFAC World Congress 2026 (IFAC WC 2026)
Supervised Reinforcement Learning for the Coordination of Distributed Energy Resources SC
The increasing integration of distributed energy resources (DERs) is crucial for power system decarbonization, yet unlocking DERs' flexibility is challenged by their inherent uncertainties and modelling complexity. As traditional optimization methods struggle with such uncertainty and complexity of DERs, reinforcement learning (RL) has emerged as a promising alternative for DER management. However, standard RL methods suffer from sample inefficiency and sub-optimality when trained from scratch. Inspired by the training paradigms in large language models, this paper proposes a Supervised Reinforcement Learning (SRL) framework for learning DER coordination policies. This framework first pre-trains a policy on demonstration data in a supervised-learning fashion, which is then further fine-tuned using RL. Furthermore, we propose a two-step fine-tuning process: offline fine-tuning for enhancing policy performance and online fine-tuning for adapting it to the real-world dynamics. Experiments demonstrate that RL implementations based on the proposed framework significantly outperform all benchmarks, achieving high cost efficiency even under low-quality demonstration data.
comment: Presented at PSCC2026
TurboMPC: Fast, Scalable, and Differentiable Model Predictive Control on the GPU
Robotics increasingly relies on GPUs for parallel simulation, large-scale learning, and neural-network inference. For model predictive control (MPC) to scale with this paradigm, solvers must run efficiently on this hardware while remaining fast, differentiable, and compatible with expressive MPC formulations used in robotics. We present TurboMPC, a differentiable MPC solver that runs entirely on the GPU and supports state and control inequality constraints, implicit integrators, cross-time-coupled costs, and slack variables. TurboMPC combines sequential quadratic programming (SQP), an alternating direction method of multipliers (ADMM) inner solver, implicit differentiation, and a co-designed JAX-CUDA implementation for efficiency and ease of use. In simulation, we validate TurboMPC on constrained planning, humanoid imitation learning, and reinforcement learning with neural-network cost function tasks, achieving up to $15\times$ and $58\times$ speedups over state-of-the-art CPU and GPU differentiable solvers, respectively. We deploy TurboMPC on a full-scale car for minimum-time racing and find that batched, GPU-accelerated tuning of MPC parameters via Bayesian optimization yields significantly faster driving than a hand-tuned baseline. TurboMPC also scales to planning horizons of over $8000$ knot points while maintaining control of the vehicle. We open-source TurboMPC at: https://github.com/ToyotaResearchInstitute/turbompc
Regret-Guaranteed Safe Switching: LQR Setting with Unknown Dynamics
We consider learning-based control in LQR setting, where the parameters associated with each mode are a priori unknown. The next mode to be activated is revealed online only at the time of switching. The objective is to determine both the switching times and the control gains for each mode such that (1) the norm of the system state remains bounded according to a prescribed criterion, and (2) the accumulated cost is minimized. To formalize the state-norm requirement, we introduce the notion of $(α,β)$-controllability for given parameters $α$ and $β$. We first study the problem in a known model setting and show that, under the switching mechanism described above and under the assumption that each mode is visited infinitely often, the strategy that minimizes the average expected cost consists of applying, in each mode, the feedback gain obtained from the solution of the discrete algebraic Riccati equation, while selecting dwell times that sufficiently satisfy the controllability condition. We refer to this strategy as the benchmark policy. Next, we propose an algorithm for the unknown-model setting that minimizes the regret, defined as the difference between the cumulative cost incurred by the online algorithm and that of the offline benchmark. By accurately estimating dwell-time errors, our method achieves an expected regret of $\mathcal{O}(|\mathcal{M}|^{1/4} n_s^{3/4} + n_m)$, where $n_s$ denotes the number of switches, $|\mathcal{M}|$ is the number of modes, and $n_m$ is the number of malignant switches.
Exploring Uncertainty Propagation in Coupled Hydrologic and Hydrodynamic Systems via Distribution-Agnostic State Space Analysis
Accurate overland runoff and infiltration predictions are critical for effective water resources management, in particular for urban flood management. However, the inherent uncertainty in rainfall patterns, soil properties, and initial conditions makes reliable flood forecasting a challenging task. This paper presents a framework for quantifying the impact of these uncertainties on hydrologic and hydrodynamic simulations via a state space approach based on a differential algebraic equation (DAE) formulation that couples surface and subsurface constraints with the governing dynamics. Under this formulation, the complex interactions between overland flow and infiltration dynamics are captured in realtime. To account for uncertainty in inputs and parameters, the proposed framework quantifies and propagates these uncertainties through the DAE model formulation under partial measurements. The effectiveness of the approach is demonstrated through a series of numerical experiments on synthetic and real world catchments, highlighting its ability to provide probabilistic estimates of watershed state conditions while accounting for uncertainty. An important aspect of the proposed methods is that they are distribution-agnostic, i.e., they only require covariances of uncertainty and not specific types of distributions. The proposed framework is further validated against Monte Carlo (MC) ensemble simulations while providing probabilistic state estimates for measured and unmeasured watershed states under partial gauging.
Policy Gradient with Self-Attention for Model-Free Distributed Nonlinear Multi-Agent Games IROS 2026
Multi-agent games in dynamic nonlinear settings are challenging due to the time-varying interactions among the agents and the non-stationarity of the (potential) Nash equilibria. In this paper we consider model-free games, where agent transitions and costs are observed without knowledge of the transition and cost functions that generate them. We propose a novel distributed policy structure that follows the communication constraints in multi-team games, with multiple agents per team, and learned through policy gradients. Our formulation is inspired by the structure of distributed policies in linear quadratic games, which take the form of time-varying linear feedback gains. In the nonlinear case, we model the policies as nonlinear feedback gains, parameterized by self-attention layers to account for the time-varying multi-agent communication topology. We demonstrate that our approach achieves strong performance in several settings, including distributed linear and nonlinear regulation, and simulated and real multi-robot pursuit-and-evasion games.
comment: The paper has been accepted and will be presented at IEEE/RSJ IROS 2026
A Multi-Worker Assembly Line Rebalancing with Spatial and Ergonomic Considerations
This work addresses the Assembly Line Rebalancing Problem driven by cycle-time changes in manual assembly systems where multiple workers operate in parallel within the same station. A multi-objective optimization model is proposed that incorporates task reassignment, worker allocation, ergonomic evaluation, and explicit spatial feasibility through work-area constraints. The formulation minimizes deviations from the current configuration while promoting balanced workload and ergonomic conditions among workers. The main contribution is the extension of assembly line rebalancing to multi-worker settings with explicit spatial constraints. Computational experiments on synthetic instances demonstrate that the model consistently generates feasible reconfigurations, highlighting its potential as a decision-support tool for industrial rebalancing in flexible production environments.
TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation
Manipulating fragile deformable containers, such as disposable plastic cups filled with liquid, demands real-time grip-force adaptation within an extremely narrow force margin: insufficient force causes slip, while excessive force irreversibly deforms the thin wall. Existing approaches struggle to achieve such force-sensitive manipulation tasks. We propose a noise-statistics-based calibration-driven reflex control paradigm with vision-based tactile sensing: by analyzing the sensor's intrinsic noise characteristics (via a brief static-hold-and-unload protocol), we directly derive all controller thresholds, eliminating external force calibration, trial-and-error manual tuning, or material-specific physical models. Instantiating this paradigm, we present TactileReflex, a three-channel closed-loop controller that extracts three image-level proxies, shear intensity ($S_y$), contact intensity ($F_n$), and center of pressure ($C$), from dual visuo-tactile sensors and drives prioritized reflex channels at ~12 Hz for slip suppression, weight-adaptive release, and force protection. Each channel closes the loop directly on its proxy via noise-derived thresholds. Ablation demonstrates that only the full three-channel system is able to prevent irreversible container deformation (5/5 success vs. at most 1/5 for partial configurations). In a dynamic pouring task, fixed-effort baselines fail in all 10 attempts due to pose drift, while TactileReflex achieves 9/10 success across two water volumes. As a self-contained and interpretable controller, TactileReflex can serve as a plug-and-play safety layer beneath high-level manipulation pipelines, including haptic-free VR teleoperation and vision-language-action (VLA) policies.
comment: 8 pages, 4 figures, 6 tables
Restricting Voltage Deviation of DC Microgrids with Critical and Ordinary Nodes: A Generalized Consensus Approach
Restricting bus voltage deviation is crucial for normal operation of multi-bus DC microgrids, yet it has received insufficient attention due to the conflict between two main control objectives in DC microgrids, i.e., voltage regulation and current sharing. By revealing a necessary and sufficient condition for achieving these two objectives, this paper proposes a novel consensus-based current sharing control law that can achieve the compromised control objective, balancing both current sharing and voltage deviation restriction. Additionally, we examine the effectiveness of the proposed control scheme for DC Microgrids that include both critical nodes and ordinary nodes, where there is a simultaneous requirement for voltage deviation limits on critical nodes and accurate current sharing among ordinary nodes. Theoretical results are verified by simulations, and the effectiveness in handling plug-and-play operations of distributed generators is also illustrated.
EMFusion: Uncertainty-Aware Conditional Diffusion Model for Multivariate Narrow-band Exposure Forecasting
The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, multivariate narrow-band EMF forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional diffusion-based EMF forecasting framework that integrates diverse contextual factors, such as time of day, season, and holidays, while providing uncertainty-aware probabilistic forecasts. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates empirical probabilistic prediction intervals from the learned conditional distribution, providing uncertainty-aware probabilistic forecasting rather than simple point estimation. Numerical experiments conducted on the multivariate narrow-band EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The proposed EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS) and 13.93% in normalized root mean square error.
comment: Accepted in IEEE Transactions on Network Science and Engineering
Feedback stabilization of switched systems under arbitrary switching: A convex characterization
In this paper, we study stabilizability of discrete-time switched linear systems where the switching signal is considered as an arbitrary external input (and not a control variable). We characterize feedback stabilization via a hierarchy of necessary and sufficient linear matrix inequalities (LMIs) conditions based on novel graph structures. We analyze both the cases in which the controller has (or has not) access to the current switching mode, the so-called mode-dependent and mode-independent settings, providing specular results. Moreover, our approach provides explicit piecewise-linear and memory-dependent linear controllers, highlighting the connections with existing stabilization approaches. The effectiveness of the proposed technique is finally illustrated with the help of some numerical examples.
comment: Accepted for publication at Automatica
AI-Driven Predictive Maintenance with Environmental Context Integration for Connected Vehicles: Simulation, Benchmarking, and Field Validation
Predictive maintenance for connected vehicles offers the potential to reduce unexpected breakdowns and improve fleet reliability, but most existing systems rely exclusively on internal diagnostic signals and are validated on simulated or industrial benchmark data. This paper presents a contextual data fusion framework integrating vehicle-internal sensor streams with external environmental signals -- road quality, weather, traffic density, and driver behaviour -- acquired via V2X communication and third-party APIs, with inference at the vehicle edge. The framework is evaluated across four layers. A feature group ablation study on a physics-informed synthetic dataset shows contextual features contribute a 2.6-point F1 improvement; removing all context reduces macro F1 from 0.855 to 0.807. On the AI4I 2020 benchmark (10,000 samples), LightGBM achieves AUC-ROC 0.973 under 5-fold stratified cross-validation with SMOTE confined to training folds. A noise sensitivity analysis shows macro F1 remains above 0.88 at low noise and degrades to 0.74 at high noise. Most critically, the pipeline is validated on real-world telemetry from five vehicles across three countries (India, Germany, Brazil), comprising 992 trips and 11 evaluable service events identified from component wear resets in the trip logs. Across six wear-driven events spanning four vehicles, the model achieves 100% detection with mean MAE of 12.2 days. A fine-tuning ablation shows the base synthetic model already achieves 6/6 binary detection; per-vehicle adaptation reduces wear-driven MAE from 25.9 to 12.2 days. SHAP analysis confirms contextual and interaction features rank among the top 15 predictors. Edge-based inference reduces estimated latency from 3.5 seconds to under 1.0 second relative to cloud-only processing.
Asymptotic Stability of Conservative Convex-Combination Dynamics on Multilayer Graphs
We study discrete-time consensus dynamics on multilayer networks in which each layer evolves via a time-varying doubly stochastic interaction matrix, and inter-layer coupling is introduced through two mechanisms: (i) distribute-then-average and (ii) average-then-distribute. These define conservative redistribution processes that preserve total mass across all layers and can be viewed as stochastic averaging driven by products of time-inhomogeneous stochastic matrices with structured coupling. For both mechanisms, we construct quadratic Lyapunov functionals that form nonnegative supermartingales, yielding almost sure convergence. The analysis combines martingale arguments with dissipation identities and connectivity properties of induced interaction graphs. Under recurrent connectivity conditions on subgraphs of the time-varying interaction structure, we prove asymptotic consensus to the global average determined by the initial total mass. This provides a unified framework for multilayer averaging dynamics, extending classical consensus results for products of stochastic matrices to settings with explicit inter-layer coupling. As corollaries, we specialize the general framework to the multilayer garbage disposal dynamics, thereby establishing convergence guarantees under natural connectivity conditions on the underlying graphs.
comment: 17 pages, 4 figures
Simple and Combination Parametric Resonances of an Electromagnetically Suspended Vehicle subject to Base Excitation
This paper investigates the dynamic stability of an electromagnetically suspended vehicle, encountered in Hyperloop and Maglev systems, subject to periodic excitations caused by surface irregularities or vibration of the support induced by external noise. The narrow clearance between the vehicle and the support can make it highly sensitive to small oscillations, since the admissible amplitudes of the vehicle oscillations can be comparable to external excitation amplitude. The vehicle is modelled as a three-degree-of-freedom model where the vehicle is suspended via two identical electromagnetic actuators from a rigid support that oscillates. The governing equations are derived using force and torque balances, incorporating nonlinear electromagnetic forces, and Kirchhoffs law for the electromagnets with PD control strategy on the airgap. The equations of motion are linearized around the steady state induced by the surface oscillation, yielding a system with time-periodic coefficients. We analytically explore both principal and combination parametric resonances using an extended Hills method, and Floquet theory is used for numerical validation. The stability boundaries are obtained as ellipses in control gain parameter space, and the influence of system parameters on these boundaries is characterized. For the principal parametric resonance, the ratio of the sizes of the two obtained ellipses is three to one, whereas for the combination parametric resonance, the ratio is fourteen to one. When all ellipses are simultaneously present, one of the ellipses associated with the combination parametric resonance is the largest.
Channel Estimation under Large Doppler Shifts and Channel Aging in NOMA-Based Air-Ground Communications
This paper investigates a multiple antenna system with non-orthogonal multiple access (NOMA) for the exchange of air traffic management data between commercial aircraft pilots and ground-based air traffic controllers. While NOMA techniques enhance spectral efficiency, their application to aircraft communications is challenged by the high speed of the aircraft (up to 214 m/s) and the long communication ranges (up to 250 km), resulting in significant Doppler shifts and low signal-to-noise ratios, respectively. To accurately assess these challenges, we employ a realistic geometry-based stochastic air-ground channel model, derived from dedicated flight measurement campaigns. In this paper, multiple aircraft simultaneously transmit data to the ground station. We focus on the channel estimation problem at the ground station under high carrier frequency offsets and the effects of channel aging due to channel's time-varying nature. For the channel estimation problem, we compare the Zadoff-Chu sequences with time-division approach under varying carrier frequency offset pre-compensation accuracies at the aircraft transmitter. For the channel aging problem and performance evaluation of channel estimators, we compute the outage probability for both the zero-forcing detector and the minimum mean squared error detector with successive interference cancellation. The results show that the favorable channel estimator-detector combinations differ between the takeoff & landing phase and the enroute cruise phase of the flight, due to the distinct channel propagation characteristics of each phase.
comment: 7 pages, 3 Figures; Accepted to the 2026 IEEE 104th Vehicular Technology Conference (VTC2026-Fall), Boston, MA, USA, September 2026
Technology configurations for decarbonizing residential heat supply through district heating and implications for the electricity network
District heating networks (DHNs) have significant potential to decarbonize residential heating and accelerate the energy transition. However, designing carbon-neutral DHNs requires balancing several objectives, including economic costs, social acceptance, long-term uncertainties, and grid-integration challenges arising from electrification. By combining modeling-to-generate-alternatives with power flow simulation techniques, we develop a decision-support method for designing carbon-neutral DHNs that are cost-effective, socially acceptable, and impose minimal impacts on the electricity grid. Applying our method to a Dutch case, we find substantial diversity in how carbon-neutral DHNs can be designed. The flexibility in technology choice, sizing, and location enables accommodating different real-world needs and achieving high electrification levels without increasing grid loading. For instance, intelligently located heat pumps and thermal storage can limit grid stress even when renewable baseload heat sources and green-fuel boilers are scarce. Using our method, planners can explore diverse carbon-neutral DHN designs and identify the design that best balances stakeholders' preferences.
On the Flexibility Potential of a Swiss Distribution Grid: Opportunities and Limitations
The growing integration of distributed renewable generation and the electrification of heating and transportation are rapidly increasing the number of flexible devices within modern distribution grids. Leveraging the aggregated flexibility of these small-scale distributed resources is essential to maintaining future grid-wide stability. This work uses the Swiss distribution grid of Walenstadt as a case study to provide insights into the aggregated flexibility potential of distribution grids. It demonstrates that incorporating devices such as heat pumps and photovoltaic systems significantly enhances distribution grid flexibility. It investigates the time-varying nature of aggregated flexibility and highlights how it can vary seasonally. Furthermore, simulations of future scenarios reveal that aggregated flexibility does not increase linearly or monotonically with higher levels of flexible device penetration. This is primarily due to the overloading of individual feeders, which underscores the impact of grid topology and network constraints on the aggregated flexibility potential.
TIP-Search: Time-Predictable Inference Scheduling for Market Prediction under Uncertain Load
Real-time market prediction services need correct predictions before a decision deadline; a correct prediction delivered late is not usable. TIP-Search studies time-predictable inference scheduling over fixed market predictors under uncertain load. It filters conformal latency-quantile feasible models, dispatches over finite workers, and uses shielded constrained online experts to trade accuracy, queue pressure, and deadline risk. On the optimized deployable pool, TIP-Search reaches 0.994 raw accuracy and 0.991 timely accuracy. On official TLOB FI-2010 h=10, TIP-Search++ raises timely accuracy from 0.156 to 0.239 and deadline satisfaction from 0.391 to 0.962. In matched h10 profiled systems replay, OCO-ACPO reaches 0.303 timely accuracy and 0.951 deadline satisfaction, with paired gains over RAMSIS/SneakPeek/utility-style comparators of $+0.00285$ timely accuracy ($p=0.0118$) and $+0.0146$ deadline satisfaction ($p=1.5{\times}10^{-5}$). SA-OCO-ACPO improves timely/deadline service by 0.188--0.417 over CPO under nonstationary stress. The claim is a systems scheduling result, not a broad LOB classifier leaderboard.
FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection
Signature-based Intrusion Detection Systems (IDS) detect malicious activity by matching network or host events against predefined rules. Security analysts manually develop these rules from Cyber Threat Intelligence (CTI). As threats evolve, this manual pipeline faces two bottlenecks. Before authoring a new rule, an analyst must reconcile the incoming CTI with the existing rule base and determine whether to create, update, or retire one. This process is challenging due to the representational differences between the CTI and Rule formats. This gap limits the effectiveness of keyword- and embedding-based search, making rule reconciliation cognitively demanding and, in turn, contributing to "rule bloat". Second, automated verification of a new rule is inherently difficult as zero-day threats lack ground truth from simulated testing. Hence, standard metrics cannot prove that a rule semantically adheres to the CTI, and the use of LLMs leads to non-deterministic behavior. To address these challenges, we introduce FALCON, an agentic framework for CTI-grounded rule retrieval, generation, and validation. At its core, a novel CTI-Rule semantic scorer, quantifies the functional alignment between a CTI and a rule; the same signal drives a retriever that surfaces relevant deployed rules and a ground-truth-free validator that scores generated ones. Around it, a generation pipeline produces deployable rules from CTI in real time and refines them through self-reflective syntactic, semantic, and performance validators. Across network (Snort) and host-based (YARA) platforms on a purpose-built CTI-Rule dataset, FALCON attains a mean relevance of 0.72 (approx), with 84% inter-rater agreement among cybersecurity analysts, underscoring the promise of real-time security automation.
comment: 17 pages, 10 figures, 8 tables
Generating adversarial inputs for a graph neural network model of AC power flow
This work formulates and solves optimization problems to generate input points that yield high errors between a neural network's predicted AC power flow solution and solutions to the AC power flow equations. We demonstrate this capability on an instance of the CANOS-PF graph neural network model, as implemented by the PF$Δ$ benchmark library, operating on a 14-bus test grid. Generated adversarial points yield errors as large as 3.7 per-unit in reactive power and 0.08 per-unit in voltage magnitude. When minimizing the perturbation from a training point necessary to satisfy adversarial constraints, we find that the constraints can be met with as little as an 0.04 per-unit perturbation in voltage magnitude on a single bus. This work motivates the development of rigorous verification and robust training methods for neural network surrogate models of AC power flow.
From Net Load Modifiers to Firm Capacity: The Role of Distributed Energy Resources in Resource Adequacy
Distributed energy resources (DERs) such as rooftop solar, batteries, demand response, and electric vehicles can contribute to power system reliability, yet their performance is difficult to translate into firm resource adequacy (RA) capacity across jurisdictions. Existing analyses often locate this difficulty within individual technical requirements, such as metering, accreditation, or dispatch performance, but give less attention to how constraints at one stage carry over to the next. This review traces the RA participation pathway through five stages: load forecasting, registration and classification, metering and verification, capacity accreditation, and performance obligations. We synthesize literature, tariffs, market manuals, and regulatory documents from California, PJM, ISO-NE, Great Britain, and Ireland, spanning U.S. capacity markets and European capacity remuneration mechanisms. Across these frameworks, similar barriers recur despite different procurement models and regulatory structures, indicating that participation is constrained by cross-stage design, not jurisdiction-specific rules alone. We identify three cross-stage couplings through which capacity value is lost between stages: mismatches between resource classification and operational obligations, weak links between verification evidence and accreditation, and temporal misalignment between planning forecasts and scarcity-hour performance. The central finding is that compliance architecture, not DER technology alone, is often the binding constraint on translating DER capability into firm RA contributions. This points to reforms that codify cross-stage information handoffs, tie accreditation to auditable verification evidence, and refresh capacity values as deployment changes system conditions. Rather than adjusting individual stages in isolation, RA reform should redesign the participation pathway end-to-end.
Symmetric Linear Dynamical Systems are Learnable from Few Observations
We consider the problem of learning the parameters of a $N$-dimensional stochastic linear dynamics under both full and partial observations from a single trajectory of time $T$. We introduce and analyze a new estimator that achieves a small maximum element-wise error on the recovery of symmetric dynamic matrices using only $T=\mathcal{O}(\log N)$ observations, irrespective of whether the matrix is sparse or dense. This estimator is based on the method of moments and does not rely on problem-specific regularization. This is especially important for applications such as structure discovery.
Linear Lyapunov Functions for Nonlinear Compartmental Systems
This technical note examines exponential stability of the null solution to a large class of compartmental systems governed by ordinary differential equations. Sufficient conditions under which these systems admit a linear Lyapunov function are provided. The coefficients of the Lyapunov functions and the exponential decay rate they yield are obtained from an eigenvalue problem. For a special case of the system class considered, we derive an equivalence between attractivity of the null solution and the existence of a linear Lyapunov function.
A Lyapunov-Based Small-Gain Theorem for Fixed-Time ISS: Theory, Optimization, and Games
We develop a Lyapunov-based small-gain theorem for establishing fixed-time input-to-state stability (FxT-ISS) guarantees in interconnected nonlinear dynamical systems. The proposed framework considers interconnections in which each subsystem admits a FxT-ISS Lyapunov function, providing robustness with respect to external inputs. We show that, under an appropriate nonlinear small-gain condition, the overall interconnected system inherits the FxT-ISS property. In this sense, the proposed result complements existing Lyapunov-based smallgain theorems for asymptotic and finite-time stability, and enables a systematic analysis of interconnection structures exhibiting fixed-time stability. To illustrate the applicability of the theory, we study feedback-based optimization problems with time-varying cost functions, and Nash-equilibrium seeking for noncooperative games with nonlinear dynamical plants in the loop. For both problems, we present a class of non-smooth gradient or pseudogradient-based controllers that achieve fixed-time convergence without requiring time-scale separation and using real-time feedback. Numerical examples are provided to validate the theoretical findings.
Robotics
AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection
Learning robust dexterous grasping requires real-world data that records the physical outcomes of grasp attempts. Such data is hard to obtain at scale: teleoperation yields valid physical outcomes but is slow and operator-biased, while simulation-based generation is cheap and scalable but cannot certify contact validity. A natural solution is to generate candidate grasps and verify them on real hardware, but this scales only if the entire collection loop (perception, execution, labeling, and reset) runs without human intervention. We present AutoDex, an automated real-world data-collection system that closes this loop: for each candidate from a replaceable generator, it localizes the object under severe hand-object occlusion with dense 20-camera perception, executes collision-monitored robot motions, labels lift-and-hold success or failure, and actively resets the object between trials to expose additional candidates across stable poses. The result is a reusable database of physically labeled grasp trials that downstream systems can query by retrieval and feasibility filtering. Using AutoDex, we collect 3,593 grasp trials across Allegro and Inspire hands on 100 diverse objects, with synchronized multi-view observations and robot-state logs. For a matched 500-trajectory collection, AutoDex requires 10.3 h versus 49.4 h for teleoperation, yielding a 4.8x throughput improvement, and grasps retrieved from the AutoDex-validated database succeed 76% versus 34% for simulation-only validation. Code and data will be publicly released.
comment: 16 pages, 9 figures. Includes supplementary material
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models ECCV 2026
Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.
comment: Accepted by ECCV 2026, Project Page: https://libero-safety.github.io/
LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.
CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation
Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: https://skevinci.github.io/coordex/
comment: Project page: https://skevinci.github.io/coordex/
A Reduced Order Model for Emergent Mechanics in Woven Systems
Woven structures exhibit rich mechanical behaviors including anisotropic stiffness, shear-induced locking, and crimp interchange that emerge purely from the geometric arrangement of individual weavers rather than from constituent material properties. Existing models either homogenize these interactions or resolve them at prohibitive computational cost. We introduce a reduced-order model that bridges this gap by representing individual weaver interactions through a system of nodes and four physically interpretable stiffness elements capturing axial deformation, in-plane uncrimping, inter-weaver shear, and frictional slip. Eigenvalue analysis of the unit cell confirms that the lowest-energy deformation modes correspond directly to known weave-specific phenomena, and that each element is necessary for a complete kinematic and mechanistic description. Element stiffness parameters are calibrated against empirical three-point bending and shear data, achieving agreement within 5% across varied weaver widths and spacings. The validated model is then applied to demonstrate capabilities beyond the reach of continuum approaches including: the emergent Poisson's response arising from crimp interchange, stepwise force reduction during progressive weaver pullout, stress localization under three distinct tearing configurations, and programmable mechanical anisotropy through spatially graded weaver stiffness. The physical transparency and computational efficiency of the framework position it as a practical tool for the analysis and design of woven architected materials with programmable mechanical response.
Flatness Preserves Instruction Following in Vision-Language-Action Models
Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques. Project page can be found at https://haochenz11.github.io/papers/flatness-vla/.
Learning Process Rewards via Success Visitation Matching for Efficient RL
In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.
Learning to See While Learning to Act: Diffusion Models for Active Perception in Robot Imitation
Most imitation learning methods assume full observability in table-top settings. In practice, objects are often occluded, requiring robots to both search and act, and learning this coupled behavior from limited demonstrations remains challenging. We propose See2Act, an imitation learning approach that conditions action prediction on a sequence of actively-inferred viewpoints at test time, by coupling action denoising with viewpoint refinement. The policy is trained using camera poses anchored to keyframe actions from offline demonstrations, enabling implicit learning of where to see, while learning how to act. We empirically demonstrate that in Ravens the policy recovers informative viewpoints under severe occlusions, and on RLBench tasks it improves performance by up to 34% over prior methods. In the real world, we collect 50 demonstrations in a digital twin and achieve zero-shot sim-to-real transfer on pick-and-place tasks using depth observations. The policy handles significant occlusions, showing that learned viewpoint reasoning enables robust manipulation under partial observability.
comment: Project website: see2act.github.io
dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models
Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \textbf{99.7\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \textbf{30.6\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.
RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models
Vision-Language-Action (VLA) models are commonly fine-tuned through passive imitation learning, where additional demonstrations are collected for tasks where the policy performs poorly. This approach incurs several downsides: it requires the robot to fail before data collection is triggered, provides little guidance about which states require supervision, and wastes demonstrator effort on redundant parts of the task where the policy already performs well. In this paper, we propose an active, continual learning paradigm for VLAs. We demonstrate that active, uncertainty-guided data collection leads to more efficient fine-tuning than when using passively-collected demonstrations. However, we also find that fine-tuning only on actively-collected recovery data leads to catastrophic forgetting. We evaluate techniques for continual learning, including replay-based data mixing and elastic weight consolidation, and identify tradeoffs between plasticity to uncertainty-guided recovery data and retention of previously learned behaviors. Overall, our work contributes an empirical study of active continual learning for autoregressive VLAs, establishing that uncertainty-guided recovery demonstrations can improve adaptation efficiency while also revealing open challenges when targeted new data is incorporated into large robot policies.
Autonomous Subsea Cable Search and Tracking with Graph-Optimised Priors and Visual Tracking
Global communications rely on subsea cable infrastructure that remains vulnerable to damage from natural hazards and human activity. Autonomous underwater vehicles (AUVs) offer an efficient means to inspect long sections of exposed cable, but uncertainty in cable route maps, small cable diameters and partial burial makes continuous tracking a challenge. This paper presents a novel cable search and tracking method that leverages uncertain prior cable route maps. Graph-based optimisation continuously update the cable route to remain consistent with visual observations. Route uncertainty is constrained as a function of distance from observations using physics-based catenary models that account for cable parameters (i.e., lay depth, diameter, and density), bounding the search space to physically feasible regions and improving search efficiency. Cable detection is performed using a semi-supervised classifier running in real-time on-board a camera-equipped AUV. These detections both update the graph-based optimisation and enable visual cable tracking. When tracking is lost due to misclassification, burial or imperfect control, the bounded search space enables efficient recovery. The approach was demonstrated in field trials using the University of Southampton's Smarty200 AUV. The system successfully located the cable despite deliberate errors in it initial cable route map, updating this to be consistent with observations and using visual tracking to inspect up to 59% of a 120m test cable, with successful recovered after tracking loss.
Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery
Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5\% and 16.6\% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.
comment: This work has been submitted to the IEEE for possible publication
KEMO: Event-Driven Keyframe Memory for Long-Horizon Robot Manipulation with VLA Policies
Long-horizon robot manipulation remains challenging because similar observations may occur at different execution stages, while the appropriate action depends on previously completed operations. Memory can address this ambiguity by enabling policies to infer task progress from execution history. However, existing memory-augmented approaches often either retain dense histories that require compression or rely primarily on recent context that may discard earlier task-relevant events. In this work, we propose propose KEMO, a lightweight plug-in memory framework that automatically selectively preserves keyframes associated with task-relevant state changes for VLA policies. KEMO combines robot kinematics with visual filtering to detect events, encodes the selected keyframes as compact temporally ordered memory tokens, and integrates them with current visual features through cross-attention and gated residual fusion for VLA training. The detected events also define higher-weight training samples near critical transitions. We evaluate KEMO on various real-world dual-arm manipulation tasks spanning 2 to 6 scored subtasks, and trajectory length ranging from 830 steps to 2846 execution steps (durations from 28 to 95 seconds). Compared with the memory-free baseline (e.g., $π_{0.5}$), KEMO improves aggregate Task Success Rate by 23.6\% and Stage Completion Rate by 34.1\%. Ablations show that event-driven keyframe selection outperforms uniform sampling and recent-frame retention, while the proposed gated fusion and keyframe-aligned loss weighting provide complementary gains.
A Generative Model for Closed-Loop Microsimulation of Signalized Intersections
Traffic microsimulators rely on hand-crafted behavior models that reproduce aggregate flow but miss the heterogeneous interactions between vehicles at signalized intersections. Learned trajectory predictors capture richer interactions but are short-horizon and tend to be unstable when run in closed loop. We present Enactor, an actor-centric generative model for closed-loop intersection microsimulation. The model focuses on vehicles; pedestrians are included as context that can influence vehicle decisions but not predicted. Dynamic actors and lane polylines are encoded in polar coordinates referenced to the intersection center. A transformer with separate spatial and temporal attention blocks predicts a distribution over each actor's next-step motion ($s$, $α$). Training uses a closed-loop curriculum so the model is exposed to its own predictions. We evaluate Enactor in two regimes. In a 4000-second simulation-in-the-loop test at two intersection geometries, Enactor controls every dynamic vehicle against a continuously refreshing actor set rather than the fixed cohort that learned trajectory predictors are usually evaluated against. It recovers the SUMO data generator's speed and travel-time distributions with KL divergence over an order of magnitude lower than a recent transformer baseline on travel time, and substantially lower on speed (roughly $5\times$ lower at Site 1), and reduces red-light violations relative to the same baseline by more than an order of magnitude. An ablation isolates the leader rear-bumper feature as the change with the largest effect on intersection-aware safety metrics. We also evaluate on real-world field data and apply the same architecture to naturalistic vehicle trajectories from a fish-eye camera at a signalized intersection and evaluate it on multi-horizon predictive tasks. Enactor outperforms a constant-velocity baseline at every horizon evaluated.
Decentralized Autonomous Traffic Management through Corridor Networks
As autonomous aircraft are introduced at scale and traffic density increases, centralized management becomes insufficient to coordinate the large numbers of crewed and uncrewed aircraft. Dedicated Advanced Air Mobility (AAM) corridors have therefore been proposed for organizing high-density autonomous traffic flows. The desire to scalably provide autonomous aircraft flexibility in trajectory planning motivates the development of decentralized approaches to traffic management in AAM corridors. In this work, we extend a multi-agent reinforcement learning (MARL) approach to address the challenge of decentralized traffic flow management in air corridor networks. We test policies trained in a single-corridor setting on increasingly complex multi-corridor networks with combinations of merges and splits in a zero-shot manner. Experimental results demonstrate that learned behaviors transfer well to scenarios with varying traffic density, network geometry, and heterogeneous vehicle performance, without needing centralized coordination or model retraining. We evaluate system-level performance in terms of conformance to corridor boundaries, completion rates, average speeds, distance traveled, and maintenance of inter-aircraft separation. We find that although our policies require only locally coordinated entry, traversal, and exit behaviors, they collectively produce desirable traffic flows through the corridor network.
comment: Presented at the Second US-Europe Air Transportation Research and Development Symposium (ATRDS2026)
A Watermark for Vision-Language-Action and World Action Models
Vision-language-action (VLA) models and world-action models (WAM) are the generative models now driving general-purpose robot control, turning raw camera input directly into motor commands. They are increasingly deployed as black-box services, where a partner runs the policy through an interface while the owner keeps the weights private. Training such a model takes proprietary data and heavy computational power, making the deployed model itself a valuable intellectual property. To address this, we propose the \emph{keyed latent-provenance verification} method, which fingerprints the policy through the seed of the Gaussian noise vector that the models draw before generation. At the injection stage, the owner swaps this seed for a keyed one with the same distribution as ordinary noise, so the fingerprinted actions are statistically identical to those of an ordinary run and an adversary watching the output finds no signal to detect or remove. At the verification stage, the owner runs the suspect model under authorized access and records the action channels the robot executes, a partial and possibly post-processed view of the policy's output. From this view, the verifier recovers the seed by gradient-based maximum a posteriori (MAP) optimization, tests it for the secret key to score each rollout, and aggregates these scores into a single decision on whether the suspect model belongs to the owner. We evaluate the method on two representative models across two robot suites. The experiments cover detection of the fingerprint, identification of which of several keys a suspect carries, robustness to a range of attacks, and an analysis of why the design works. Across both models, the fingerprint can be detected reliably with little change to task performance, and it remains detectable under output-side removal attacks and weight-level edits.
HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory
LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.
BiliVLA: Scene-Aware Vision-Language-Action Model with Reinforcement Learning for Autonomous Biliary Endoscopic Navigation
Endoscopic retrograde cholangiopancreatography (ERCP) demands precise endoscopic navigation and stable biliary cannulation within a narrow monocular field characterized by specular reflections, partial occlusions, and frequent tissue contact. Although recent robotic systems and vision-based assistance techniques improve operator ergonomics and provide perceptual cues, their performance degrades under pronounced anatomical variability and safety-critical visual artifacts, which hinders reliable autonomy in cannulation-grade procedures. Here, we present BiliVLA, a scene-aware Vision-Language-Action (VLA) framework that formulates biliary endoscopic navigation as an instruction-conditioned visuomotor learning problem. Given an endoscopic observation and a stage-specific language instruction, BiliVLA jointly predicts the target category, a grounded bounding box, and a discrete three degrees of freedom (DoF) motor command for a continuum endoscope. The proposed framework incorporates scene-aware supervision to enhance semantic target consistency and safety-aware recovery supervision to induce conservative retreat behaviors under luminal wall contact. A key component of BiliVLA is a two-stage training paradigm that combines grounding-enhanced supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), which significantly improves action reliability and decision consistency during closed-loop navigation. Across three ERCP subtasks, BiliVLA achieves an average action precision of 91.96\% and an overall success rate (SR) of 84.85\% in real-world phantom experiments. These results indicate that integrating semantic grounding, scene-aware learning, and reward-guided optimization improves perception-action alignment and enables robust autonomous endoscopic navigation.
DVL-DeepONet: A Physics-Guided Operator Learning for Resilient Underwater Navigation
Autonomous Underwater Vehicles (AUVs) rely heavily on the fusion of inertial sensors and Doppler velocity logs (DVLs) for navigation. In standard autonomous navigation systems, the DVL measures four beam velocities, thereby enabling the estimation of the AUV velocity vector. However, during real-world missions, the DVL may receive noisy or incomplete beam measurements due to marine obstacles, seabed reflections, or environmental disturbances. Furthermore, some low-cost underwater platforms operate without inertial sensors to reduce system complexity and cost. In such cases, reliable estimation of the AUV velocity vector in real-world missing beam scenarios becomes challenging, leading to degraded navigation solutions. To circumvent these challenges and enable resilient underwater navigation, we propose DVL-DeepONet, a physics-guided deep neural operator framework along with three variants. The proposed models are designed to estimate DVL-based velocity information under multiple operational scenarios, including (i) noise-resilient estimation in coupled inertial/DVL measurements, (ii) DVL-only learning, and (iii) beam measurement recovery. By learning a nonlinear operator that maps temporal inertial/DVL observations directly to vehicle velocity while enforcing DVL measurement physics through a consistency constraint, the proposed approach enables robust velocity estimation even under degraded sensing conditions. The proposed framework is validated using real-world AUV experiments, comprising a cumulative path length of approximately 10,000 m. Experimental results demonstrate that the proposed DVL-DeepONet architectures outperform baseline model-based approaches and learning-based algorithms by 40%.
comment: 15 pages, 6 figures
SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors
Accurate dynamics models are critical for informed decision-making in robotic systems, particularly for agile aerial vehicles operating under uncertainty. Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time. Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space, yet prior JEPA-style methods for robot navigation have been studied primarily for kinematic-level planning, with limited investigation in high-frequency control. In this work, we introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state, enabling physically grounded long-horizon prediction. Additionally, we combine the learned model with a sampling-based optimal control solution to take advantage of its predictive capabilities for real-time control on embedded hardware. Finally, to reduce the dependence on expensive and unsafe real-world data collection, we develop a structured pipeline for automated dataset generation. Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions.
comment: Under Review
DexTeleop-0: Force-Aware Bimanual Dexterous Teleoperation with Ego-Centric Perception towards Shared Autonomy
Fine-grained, bimanual dexterous manipulation remains a foundational challenge in robotics. Traditional teleoperation systems often fail in contact-rich tasks because embodiment gaps hinder accurate kinematic mapping, while tactile and force feedback remain absent. Consequently, data collection efficiency for high-precision tasks remains prohibitively low. To address these limitations, we propose a tactile-driven adaptation strategy designed to enable fine-grained manipulation on top of teleoperation pipelines. Instantiated within our bimanual dexterous framework, DexTeleop-0, this strategy introduces a real-time optimization loop that bridges the embodiment gap by translating coarse human tracking intents into precise, force-compliant robotic commands with tactile sensing. By estimating accurate contact points and leveraging a tactile-enabled fingertip force-sensing profile, the system dynamically computes localized corrections using the operational space Jacobian with respect to joint angle updates. We rigorously evaluate this tactile-driven adaptation strategy across both simulated environments and real-world hardware. Compared with representative baselines, the proposed method consistently achieves higher task success rates and improved execution efficiency in robust grasping, disturbance-resilient manipulation, and complex dexterous tasks.
comment: 15 pages, 6 figures, 5 tables
Flowing With Purpose: Latent Action Guided Flow Matching Policies For Robotic Manipulation
Flow matching has recently become a new standard for behavior cloning in robotic manipulation. However, state-of-the-art flow matching policies suffer from a systematic structural mismatch: they rely on a globally fixed isotropic source distribution despite the strongly fragmented and heteroscedastic structure of robotic action spaces. This agnostic initialization forces the model to learn highly entangled vector fields, bottlenecking training efficiency and limiting overall policy performance. To address this limitation, we introduce Latent Action Guided Flow Matching (LAFM), a novel framework that replaces the monolithic Gaussian with an adaptive library of learned prior distributions. By grounding these distributions using a latent action model, LAFM maps current observations to discrete motion primitives, selecting a specialized base distribution that provides an informed, structurally aligned initialization for the denoising process. This dynamic adaptivity naturally accommodates heteroscedasticity in human demonstrations and makes transport trajectories shorter and less entangled. Empirically, LAFM substantially outperforms standard flow matching formulations, increasing task success rates by 23.4% in real-world robotic deployments and by 10.4% on the LIBERO-90 benchmark. Furthermore, we demonstrate that LAFM achieves state-of-the-art results, surpassing massively pre-trained vision-language-action models while utilizing significantly smaller architectures.
TSD: A Physics-Inspired Trajectory Saliency Detector for Efficient Imitation Learning
For imitation learning in robotic manipulation, high data collection costs result in the scarcity of high quality data. In this paper, we leverage the inherent heterogeneity of trajectories to address this challenge. Based on our observations of manipulation tasks, we categorize motions into transitional, precise, and agile types, defining the latter two as trajectory saliency due to their criticality to task success in contrast to the prevalent but less relevant transitional motions. Therefore, we propose the Trajectory Saliency Detector (TSD), a training-free and plug-and-play framework to identify trajectory saliency. TSD employs two physically-grounded metrics: spatial entropy to capture fine-grained manipulation and centripetal acceleration to detect agile maneuvering. We further leverage TSD to develop a dataset compression method that reduces training costs and a dataset expansion strategy that improves data collection efficiency. Extensive experiments in both simulation and real-world settings demonstrate that models trained on TSD-condensed datasets achieve comparable or even superior performance with 25% less data on average. These results validate the effectiveness of our dataset compression and expansion strategies, thereby confirming the utility of TSD. Consequently, TSD offers a scalable and cost-effective pathway to synthesize information-dense datasets for efficient robot learning. Project page: https://trajectory-saliency-detector.github.io/trajectory-saliency-detector/
A Relaxed Quadratic-Program-based Framework for Trajectory Tracking of Unicycle Robots with Singularity Avoidance
Dynamic feedback linearization (DFL) is a classical technique for trajectory tracking of unicycle-type mobile robots, but the resulting DFL-based controller becomes singular when the linear velocity vanishes, rendering standard DFL-based controllers unsuitable for stop-and-reverse maneuvers. This paper proposes a quadratic-program (QP)-based optimal control framework that avoids this singularity, while establishing local Lipschitz continuity of the resulting feedback law. Our approach reformulates the DFL constraints as an equality-constrained QP with a slack variable, ensuring feasibility for all states and reference signals, including at points where the robot's velocity vanishes. By introducing slack variables and tunable parameters, we demonstrate that the singular configuration can be avoided for a large class of reference trajectories. The effectiveness of the proposed approach for trajectory tracking is demonstrated through ROS 2-Gazebo simulations on a TurtleBot3 Waffle robot. The code is available at https://gradslab.github.io/DFL_QP_Unicycle/
comment: 6 pages, 4 figures, paper accepted at Conference of Control Technology and Applications (CCTA) 2026
When Robots Rate Their Own Interactions: Engagement Validity and the Strangeness Failure
Human-robot interaction (HRI) evaluation relies almost exclusively on human-completed questionnaires, leaving the robot's perspective unexamined. We propose an \textit{inverted evaluation}, in which LLM-powered robots complete the same standardized instruments from their own perspective, and test whether these ratings agree with human ground truth. In Study~1, five LLMs completed HRI-CUES, Godspeed, and RoSAS questionnaires for 25~interactions ($N = 1{,}522$ evaluations) from the HRI-CUES dataset. LLMs achieved moderate-to-strong agreement on engagement dimensions (satisfaction $r$ up to $.65$ and enjoyment $r$ up to $.72$) with excellent test-retest reliability (ICC $\geq .82$), but \textit{systematically inverted} the comfort/strangeness dimension ($r = -.44$ to $-.67$, all $p < .05$), conflating engagement with comfort. In Study~2, a Nao robot running Claude~Sonnet~4.5 replicated these patterns in live interactions ($N = 4$), including real-time turn-by-turn assessment. The strangeness failure persisted across five models, synthetic controls, and embodied deployment for two participants. We argue that current LLM-based robots lack access to the internal affective states needed to assess constructs like strangeness, and that inverted evaluation requires supplementary modalities (e.g., physiological signals, gaze, proxemics) to move beyond behavioral proxies. These findings establish boundary conditions for using LLMs as interaction evaluators in HRI.
From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models IROS 2026
Operating in complex real-world environments requires robots to understand their surroundings on a functional semantic level. This demands a detailed multi-layer world model capturing the complex relations of its surroundings. Hierarchical 3D scene graphs address this challenge by integrating geometric, semantic, and relational data within a unified spatial framework. However, current 3D scene graph approaches often restrict themselves to rigid structures of pre-determined relationship classes, mostly neglecting important semantic connections, like causal connections or environmental contexts. This paper explores the potential of foundation models to build forests of 3D scene graphs with open semantic relationships to improve scene understanding and robotic task execution. We propose a method where instance-specific concept-nodes and relationships are first identified by a VLM and extended upon by a LLM, inferring broader, more abstract concept-nodes and relationships through reasoning. These object-nodes, concept-nodes, and relationships are then assembled into a forest of hierarchical 3D scene graphs, enhanced with concept-nodes to represent abstract concepts. Evaluations were conducted on the uHumans2 and ScanNet indoor dataset, validating the accuracy and relevance of the generated relationships. Downstream suitability of scene-graph forests for robotics applications is demonstrated in an open-vocabulary object-retrieval task utilizing both ScanNet data and a real-world indoor deployment using a Boston Dynamics Spot. This paper leverages foundation models to create more expressive, semantically deep 3D hierarchical scene graphs and demonstrates their potential to advance semantic and environmental understanding in robotics.
comment: To be published in the Proceedings of the IEEE/RSJ International Conference on Intelligent Robots & Systems (IEEE IROS 2026)
IOI: Decoupling Kinematics and Physics for Interactive World Models
Developing generalist embodied agents requires interactive environments providing visually realistic feedback and accurate action-conditioned dynamics. Interactive world models address this by simulating such complex dynamics. However, purely data-driven methods struggle to ensure precise control alignment and physically plausible visual feedback due to a lack of explicit structural constraints. To address this, we propose IOI, a hybrid interactive world model integrating analytical kinematic priors with learned physical dynamics. Unlike data-driven approaches prone to spatiotemporal drift, IOI introduces explicit kinematic guidance, computing forward kinematics from action sequences for accurate motion trajectories. These trajectories are rendered into synchronized front, side, and top orthographic projections, eliminating the need for extrinsic camera calibration. A Multi-view Kinematic Aggregation and Injection module fuses these geometric cues and injects them into the video generator, providing geometry-consistent guidance. Conditioning video generation on these deterministic trajectories establishes a synergy between the analytical simulator and the world model. Decoupling deterministic motion into the kinematic prior frees the generator to model stochastic physical interactions. Experiments on the RoboTwin benchmark validate IOI across kinematic fidelity, out-of-distribution (OOD) generalization, and policy evaluation. IOI achieves state-of-the-art simulation performance and robust zero-shot generalization to unseen OOD tasks. Furthermore, IOI serves as a reliable policy evaluator, yielding success rates closely aligning with ground-truth physics simulators. On real-world platforms, policies trained on IOI-synthesized data match those trained on teleoperation demonstrations, solidifying its practical value for embodied policy learning.
Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation
6D pose estimation is a key task in computer vision and embodied AI, widely used in robotic manipulation, augmented reality, etc. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges in category-level pose estimation: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. This paper proposes Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization-continuous pose regression strategy. Rotation and translation parameters are first discretized into bins, with a discrete flow matching model locking the latent space around the true pose to reduce search complexity. Then, by sampling in the latent space, a continuous flow matching model predicts local pose residuals to optimize the estimate and regress to an accurate pose. The framework also naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS. Project website: https://flow6d.github.io/.
comment: Accepted for publication in IEEE Robotics and Automation Letters (RA-L), 2026
Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation
Automated Reward Design (ARD) aims to replace manual reward engineering in reinforcement learning with language-driven reward function synthesis. However, existing approaches based on large language models (LLMs) remain inherently correlation-driven, relying on iterative environmental feedback to refine reward hypotheses for each specific task. This paradigm not only results in inefficient reasoning but also makes LLMs susceptible to semantically plausible yet causally spurious reward components, leading to ineffective optimization. To address these limitations, we propose the Causal Reward World Model (CRWM), which explicitly models the causal topological relationships between candidate reward components and task-targeted physical variables through offline pre-training on multi-task interaction data. Based on a coarse-to-fine pre-training strategy, we introduce a joint optimization module that integrates Explicit Mechanism Decoupling with Confidence-Aware Soft Fusion to refine coarse structural priors using micro-level trajectories, thereby constructing a robust and interpretable causal skeleton. During inference, LLMs leverage CRWM as a task-irrelevant causal prior to constrain the reward generation, enabling zero-shot reward function design. Our work opens up a new white-box paradigm for the ARD problem. Extensive experiments on complex continuous control benchmarks demonstrate that CRWM generates executable reward functions without feedback-driven reward refinement, significantly reducing the design latency for acquiring new robotic skills while matching or surpassing state-of-the-art performance, and further exhibits strong generalization capabilities across unseen tasks and diverse robotic embodiments.
comment: 22 pages, 18 figures
Conceptual Design of an Ecosystem for Real Farm Data Collection toward Agricultural AI Foundation Models
Data scarcity is a fundamental challenge in developing AI and foundation models for agricultural robots. Existing open-source data platforms do not provide sufficient incentives for data providers so long-term data collection remains difficult. Furthermore, advances in generative AI have introduced a new challenge of verifying that collected data genuinely originates from real farm environments. We propose an ecosystem for the sustainable collection and distribution of real farm data, integrating automatic pricing driven by demand and rarity, revenue sharing that distributes earnings to farmers as an incentive to keep providing data, and data authenticity guarantees through authenticated device uploads. To demonstrate the economic sustainability for all three parties among farmers, AI companies, and the platform, we estimate the economic value that agricultural robots stand to generate.
LP-NavOA: Integrated Local Navigation and Obstacle Avoidance for Humanoid Robots under Limited Perception
Humanoid local navigation in cluttered environments must jointly resolve obstacle avoidance, sparse-goal recovery, and stable whole-body locomotion under short-range and partially observable sensing. Explicit planner-control decompositions introduce latency and can mismatch agile humanoid command-tracking limits, while purely reactive controllers may lose the goal after obstacle occlusion. We present LP-NavOA, a limited-perception navigation and obstacle-avoidance framework for humanoid robots. A raycast-conditioned perception-action proximal policy optimization (PPO) locomotion backbone is first trained with a robot-centered circular heading-speed command and a shared command-side safety filter. With this backbone frozen, A-star and waypoint teachers generate rollouts for distilling a recurrent local planner that overwrites only the heading command at deployment, leaving the whole-body policy intact. At runtime, LP-NavOA uses proprioception, short-range local range sensing, and a body-frame goal direction, requiring no global map, waypoint stream, or external planner. In MuJoCo open-wall and indoor layouts, the distilled planner produces obstacle bypassing and post-avoidance goal recovery, raising teacher-calibrated on-time arrival from 38--40\% to 85--97\% and reducing brush/contact-heavy progress relative to a backbone-only controller. Ablations show that dynamic route shaping, teacher-active data collection, and the circular command interface are important for navigation efficiency and for training the 3.0\,m/s backbone. A Unitree G1 deployment analysis demonstrates hardware executability without continuous joystick steering.
Lessons from the Field: A Case Study of Robotic Intervention in an Industrial Emergency
Incidents in chemical plants can pose a high level of risk and harsh environments for first responders. Contamination and explosion hazards can deny human access to the affected infrastructure, underscoring the need for capable robot systems. This field report documents the successful deployment of a robotic task force to neutralize an explosive gas hazard at a chemical plant after a fire incident. An Unmanned Ground Vehicle (UGV) with a custom manipulation tool opened a critical valve under hazardous conditions, averting the threat of a large-scale explosion. We provide insights into robot deployment and use the mission results to highlight both the importance of rescue robotics and limitations of using research platforms in real emergency deployments, such as communication constraints and the need for enhanced operator-assistance functions.
comment: Accepted final version. IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Galway, Ireland, October 2025
Bridging Semantics and Kinematics: A Modular Framework for Zero-Shot Robotic Manipulation
This paper presents a modular training-free framework for zero-shot, language-guided robotic manipulation in semi-structured environments. The architecture bridges the gap between high-level reasoning and low-level kinematics by decomposing the vision-action pipeline into three stages: visual perception, semantic interpretation, and task execution. To overcome the spatial ambiguity and semantic hallucinations inherent in standard Vision-Language Models (VLMs), the perception module employs FastSAM and Set-of-Mark (SoM) prompting to dynamically generate grounded, alphanumeric visual anchors. The same foundation model then operates purely as a Large Language Model (LLM) to act as a semantic router, translating unconstrained human directives into verifiable, reconfigurable configurations. Finally, these configurations are dynamically parsed by a Task Orchestrator into MoveIt Task Constructor (MTC) to generate collision-free trajectories. The framework is evaluated across two zero-shot experimental setups: unconstrained open-world sequential manipulation and dense relational spatial reasoning, achieving a 62% end-to-end task success rate across both scenarios, demonstrating its capacity to reliably execute complex physical actions without domain-specific training or manual coordinate programming.
comment: Accepted to RO-MAN 2026
Asymmetric physics enables efficient learning in quadrupedal robot swarms
Animal collectives navigate cluttered environments through local coordination, yet robot swarms still struggle to reproduce this capability in the physical world. End-to-end learning offers a route to such coordination, but scaling it to embodied swarms remains difficult: standard sampling-based reinforcement learning becomes inefficient when visual perception, dense robot-robot interaction, and contact-rich locomotion must be learned together. Here we show that asymmetric physics enables efficient end-to-end learning of vision-based, decentralized control in large swarms of quadrupedal robots. During training, quadrupeds interact in shared environments, where a high-fidelity, non-differentiable simulator generates realistic motion and contact dynamics, and differentiable surrogate models provide gradients for navigation and locomotion policies. This separation enables up to 512 quadrupeds to learn coordinated navigation policies in obstacle-rich environments. At deployment, each robot acts from a single forward-facing depth camera, without explicit communication, centralized planning, or global maps. The policies generalize across forests, bridges, enclosures, narrow passages, and mazes, and zero-shot transfer to six physical quadrupeds across five real-world scenarios. The resulting swarms exhibit predictive avoidance, right-side yielding, pausing before bottlenecks, and wall following, showing that asymmetric physics enables efficient training of scalable decentralized control policies for quadrupedal robot swarms.
ShotcreteDepth: A Bi-modal Dataset for Robust Robotic Depth Perception in Shotcrete Construction Environments
We introduce ShotcreteDepth, a bi-modal dataset from the construction domain that captures both an active shotcreting process and general construction environments. The dataset comprises stereo RGB imagery and LiDAR point clouds acquired under harsh real-world conditions, including high turbidity and poor illumination. Such conditions adversely affect sensor measurements, leading to incomplete and noisy observations that pose significant challenges for perception systems in autonomous applications. Alongside the dataset, we release a lightweight annotation tool designed for time-efficient labeling of LiDAR point clouds. ShotcreteDepth consists of 11,252 temporally synchronized data samples, of which 220 are annotated for evaluation purposes. The dataset supports research in stereo matching, depth completion, and depth estimation under conditions that closely reflect the operational complexities found in industrial settings. Project repository: https://github.com/dtu-pas/shotcrete-depth
Assistron: Bayesian Shared Autonomy with Off-the-shelf Vision-Language-Action Models
We propose Assistron, a shared autonomy model that leverages Vision-Language-Action (VLA) models to assist the user in daily activities. Our approach is grounded in two core principles: (1)~minimizing human cognitive and physical effort by leveraging VLA-driven autonomy for macro-movements, and (2)~prioritizing human intervention specifically at critical failure points. Driven by the user's verbal language commands, Assistron utilizes the VLA to autonomously execute macro-reaching trajectories, saving users' effort. In contact-rich interactions where VLAs tend to fail, Assistron employs a phase-aware interaction detection mechanism and solicits the user to intervene, in turn adjusting the VLA's action generation via flow matching guidance. Critically, our formulation eliminates the need for VLA fine-tuning, protecting its broad behavioral priors from catastrophic forgetting and ensuring the model does not become a narrow specialist. We validate our approach on a comprehensive multi-task scene recovery benchmark encompassing diverse daily manipulation skills. Empirical results demonstrate that Assistron significantly improves task success rates over pure autonomous baselines while significantly reducing human cognitive and physical workload compared to traditional teleoperation, offering a scalable, smooth, and effortless paradigm for assistive manipulation. The code is available in https://github.com/mousecpn/Assistron.git.
comment: Using VLA in assistive robotics
Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation
Cross-embodiment data have become central to training robotic foundation models. To leverage such heterogeneous data, we focus on flow-based object manipulation, where robot flows (robot velocity fields) serve as embodiment-agnostic motion representations. Previous studies do not formulate robot flows as dense velocity fields, but as displacements of sparse keypoints, while such velocity fields better match the continuous-time nature of motions. We propose Flow as Flow, a framework that models robot flows as probability flows based on a flow matching formulation. By naturally modeling such velocity fields within this formulation, our method achieves efficient and high-quality robot flow generation. Across standard benchmarks, our method outperforms representative baseline methods on standard metrics, while achieving approximately 33$\times$ faster generation. Furthermore, through real-world experiments evaluating 9 methods with 260 trials per method across 13 manipulation tasks, we show that our method achieves a higher average success rate than the baseline methods. Our project page is available at https://flow-as-flow-u0n5y.kinsta.page.
Foresight: Failure Detection for Long-Horizon Robotic Manipulation with Action-Conditioned World Model Latents
Long-horizon tasks are common in real-world robotic deployments, yet failure detection for such tasks remains underexplored. Detecting failures in long-horizon robotic tasks is particularly challenging because failure onset is often ambiguous and dense temporal annotations are typically unavailable. We present Foresight, a failure detection framework that monitors manipulation trajectories using latent representations from an action-conditioned world model. Foresight is trained using only final task-level success or failure labels. By leveraging predictive world-model embeddings, our method provides a unified framework for failure detection across different policies. We further use functional conformal prediction (FCP) to calibrate detection thresholds adaptively. We evaluate Foresight with state-of-the-art vision-language-action policies in simulation on LIBERO-Long, ManiSkill-Long, and BEHAVIOR-1K, compare it against state-of-the-artfailure detection methods, and validate it on real robots with three long-horizon tasks on a ReactorX-200 arm and one task on a Franka arm. Our results suggest that action-conditioned world-model embeddings provide a scalable representation for reliable failure monitoring in long-horizon manipulation.
AdaReP:Adaptive Re-Planning under Model Mismatch for Neural World-Model Predictive Control ICANN 2026
Neural world models coupled with model predictive control (MPC) replan at every environment step to bound accumulated prediction error, but this incurs substantial computational overhead. Reusing a cached plan reduces this overhead, yet its effectiveness depends on how prediction mismatch propagates through the local dynamics. We analyze this trade-off with a perturbation-based dynamic-regret framework and show that stale-plan penalties scale with the reuse tolerance, the accumulated mismatch since the last replanning step, and the local dynamics sensitivity. Based on this structure, we propose AdaReP, a training-free wrapper that adapts the replanning tolerance online using the current deviation from the cached rollout and a local sensitivity estimate, without modifying the learned world model or planner. Across image-space planning, latent-space control, and real-world robotic manipulation, AdaReP substantially reduces planner-side computation while maintaining comparable task performance, including over 80% fewer queries on a 50-trial physical robot study.
comment: Accepted at ICANN 2026. This arXiv version contains supplementary materials and appendices that are omitted from the conference version due to space limitations
ISOPoT: Imaging Sonar Odometry by Point Tracking
Reliable navigation in underwater environments remains a key challenge in marine robotics. In such scenarios, forward-looking sonars are a natural choice for long-range perception, offering wide coverage even in turbid, low-visibility conditions. However, sonar images are inherently noisy, contain artifacts, and lack rich semantic structure, causing standard computer vision methods for keypoint detection and matching to perform poorly. In this paper, we introduce ISOPoT, an imaging sonar odometry method based on modern point tracking techniques. We propose a sonar odometry pipeline that uses multi-frame point tracks as its primary correspondence representation, augmented with lightweight optimizations to improve robustness. We evaluated the proposed method on the Aracati 2017 dataset, as well as on an internal sonar dataset collected in real-world underwater environments. Our results show that ISOPoT outperforms previous state-of-the-art methods consistently in both sonar-only scenarios and in multi-sensor settings.
TEXEDO : Test Time Scaling for Controller-aware Language-conditioned Humanoid Motion Generation
Text-conditioned motion generation is a promising interface for programming humanoid robots, yet current generators are often trained on human motion datasets retargeted to robot morphologies. Although such data provides rich semantic and kinematic priors, it fails to capture the nuances of whole-body tracking controllers, including balance, contact dynamics, actuation limits, and controller-specific failure modes. As a result, generated motions can be semantically plausible but difficult or impossible for the robot to execute. We introduce TEXEDO, a test-time scaling framework for humanoid motion generation that improves motion quality without requiring a stronger underlying generator. Given a text prompt, TEXEDO samples multiple candidate motions from a pretrained text-conditioned generator and selects the best motion that is both executable and task-aligned. The reward model combines a dynamic feasibility verifier, distilled from whole-body tracking rollouts to predict physical executability, with a semantic alignment verifier that measures text-motion alignment in a learned co-embedding space. Our pipeline treats dynamic feasibility as a hard constraint and semantic alignment as the selection objective within the feasible set. Through large-scale simulation studies and real-world deployment on a Unitree G1 humanoid robot, we show that TEXEDO consistently improves both tracking fidelity and text alignment. These results demonstrate that grounded verification is an effective path toward deployable language-guided humanoid motion generation. Project website: https://jianuocao.github.io/TEXEDO/
Can Single-View Mesh Reconstruction Generalize to Robot Camera Rotation?
Single-view mesh reconstruction predicts object meshes and spatial layouts from a single observation, making it attractive for fast robot spatial reasoning and real-to-sim digital twins. However, robot-mounted cameras naturally rotate during manipulation and navigation, while learned single-view reconstruction models often rely on view-dependent priors and may generalize poorly to out-of-distribution camera rotations. Such rotations can introduce 3D inconsistencies, incorrect layouts, and violations of physical constraints, but this failure mode remains under-evaluated. We introduce an evaluation protocol with controlled axis-wise roll, pitch, and yaw sweeps to trace errors in monocular depth estimation (MDE), canonical object meshes, camera-space layout, and physical plausibility within a representative SAM3D-style pipeline. On the Aria Digital Twin dataset and a real Franka wrist-camera sequence, camera rotations induce MDE distortion, layout drift, and collision penetration, while canonical mesh predictions remain relatively stable. A two-stage SAM3D+FoundationPose pipeline is more robust than one-stage feed-forward layout prediction, and our Gravity-Aware Refinement reduces one-stage pairwise ICP-based layout-orientation error by 47.1$\%$. Our evaluation reveals that current single-view mesh reconstruction methods generalize poorly to robot camera rotation, and suggests that explicit gravity cues are important for reliable robotic single-view mesh reconstruction.
Distilling Collaborative Dynamics into Latent Space for Implicit Coordination in Decentralized Multi-Agent Manipulation IROS 2026
Multi-arm manipulation demands precise spatiotemporal coordination, yet many centralized approaches scale poorly as team size increases. To address this, we propose CLS-DP, a decentralized multi-agent framework that enables implicit coordination under partial observability without shared global views, explicit state information, or inter-agent communication. Under the centralized training and decentralized execution (CTDE) paradigm, CLS-DP distills privileged multi-agent dynamics into a latent space. At deployment, each agent infers a collaborative latent from its local RGB observation and a shared task instruction; it then conditions the diffusion denoising process on this latent. This design enables implicit coordination with a per-agent cost independent of team size. Across six RoboFactory benchmark tasks spanning two to four agents, CLS-DP achieves a 38% mean success rate, outperforming the best centralized baseline (20%) and a decentralized ablation without the collaborative latent (9%). It also maintains superior parameter efficiency across all agent configurations. Attribution maps show that an agent conditioned on the collaborative latent places high attribution on the joints and grippers of both itself and its teammates throughout execution. This suggests that the learned latent efficiently encodes collaborative dynamics from local observation, which facilitates implicit coordination in realistic settings characterized by partial observability.
comment: Accepted to IROS 2026 | Project Page: https://cosdeneb.github.io/cls-dp/
Humanoid-OmniOcc: Stereo-Based Full-View Occupancy Dataset for Embodied AI
Occupancy prediction at voxel-level granularity is essential for safe robotic navigation and interaction in complex environments. Existing occupancy datasets, however, are predominantly designed for autonomous driving with vehicle-centric biases -- forward-facing cameras, far-field geometry, and static road priors -- limiting their applicability to embodied humanoid perception. We present Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset tailored for humanoid robots. The dataset encompasses 15 diverse simulated indoor scenes and 5 real-world environments, yielding over 155K samples with broad scene and style diversity. Importantly, the dataset is designed around a Real2Sim2Real closed-loop paradigm: real sensor specifications drive physically accurate simulation, simulation produces large-scale annotated training data, and models trained in simulation are directly evaluated on real-world captures -- enabling iterative refinement of the sim-to-real pipeline. We further propose \textbf{H}umanoid \textbf{S}urround \textbf{S}tereo-guided \textbf{Occ}upancy model (Humanoid-OmniOcc) that exploits robust depth priors for accurate 2D-to-3D lifting. Extensive experiments show that Humanoid-OmniOcc consistently outperforms monocular baselines and generalizes well to both unseen simulated test scenes and real-world environments, validating the effectiveness of the Real2Sim2Real design. Code and data will be available upon acceptance at https://d-robotics-ai-lab.github.io/humanoid-omniocc.
PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot
Vine robots, a class of soft, growing robots, are suitable for navigating complex and confined environments due to their compliant bodies and self-supporting growth mechanism. However, hysteresis, tether interactions, and deformations make them difficult to predict and model, which in turn limits the effectiveness of conventional planning and control approaches. In this work, we present a data-driven, vision-based control framework for the first autonomous vine robot system. Our system integrates 19 cameras distributed along the robot's body to provide comprehensive feedback of both the robot state and the surrounding environment. Using this rich whole-body vision feedback, we train an end-to-end visuomotor policy from demonstrations for closed-loop autonomous control in complex environments. The policy efficiently aggregates information from distributed sensing while maintaining robustness to inaccurate robot states and actuation. Experimental results demonstrate that the learned policy enables robust navigation and manipulation in challenging scenarios, including steering through branched structures, climbing up slopes, traversing unsupported terrain, reaching objects precisely, and maneuvering through confined spaces and obstacles. Project website https://panovine-bot.github.io
Improving Robotic Imitation Learning via Trajectory Standardization
Imitation learning for robotic manipulation relies on large sets of human demonstration trajectories, which are often noisy and temporally irregular due to variable operator speed, intermittent pauses, and inconsistent action density. A common preprocessing strategy is time-uniform downsampling to shorten sequences, but it cannot effectively remove speed-induced non-uniformity or redundant pauses. This mismatch degrades data quality and hinders policy learning. To address this issue, we propose Information-Standardized Trajectory Resampling (ISR), an offline preprocessing method for effective imitation learning. ISR resamples each trajectory by enforcing approximately equal information distance between adjacent points. Specifically, we map trajectories onto an information-modulated Riemannian manifold and perform geodesic-equidistant parameterization. We construct an information-intensity field from velocity and acceleration norms: the velocity term removes small-motion redundancy, while the acceleration term preserves high-curvature and fine-manipulation phases. We evaluate ISR on three real-world manipulation tasks with mainstream imitation learning policies. Compared with the baseline time-uniform 3x downsampling, ISR improves task success rates by about 25%, remains robust across datasets collected from different operators, and reduces both dataset size and training cost. The code and videos are publicly available at https://d-robotics-ai-lab.github.io/isr.page.
A Vendor-Agnostic LiDAR Data Conversion System with Multi-Signal Detection and Multi-Format Output
LiDAR (Light Detection and Ranging) sensors capture the surrounding environment as dense 3D point clouds by measuring the time-of-flight of emitted laser pulses, making them foundational across autonomous vehicles, robotics, and large-scale mapping. PCAP (Packet Capture) files from these sensors are the starting point of most 3D perception pipelines, yet internal packet structures, UDP (User Datagram Protocol) port conventions and encoding schemes differ enough across manufacturers that no single tool reads them all. Ouster, Velodyne, Hesai, and Livox each require their own SDK (Software Development Kit), their own environment setup, and their own conversion workflow. Supporting all four means maintaining four disconnected pipelines with no shared infrastructure. The pipeline described here takes a raw PCAP as input and handles vendor identification automatically, scoring six independent file characteristics through a weighted multi-signal approach to determine the source sensor. C++ SDKs handle Ouster and Velodyne, while Hesai and Livox rely on Python-based dpkt parsing where no open source SDK exists. From there, a single command writes output to any of five industry-standard formats. We tested on real outdoor captures. Ouster peaks at 2.08M points per second, Velodyne at 1.47M, both running through native C++ packet decoding. Hesai and Livox land at 110K and 150K respectively, where Python-layer parsing introduces overhead that compounds under sustained load. The 8-10x gap held consistently across runs. Tested on a consumer-grade i3 with 8GB RAM, no vendor configuration required
comment: Manuscript under review at Expert Systems with Applications (Elsevier)
HiL-ResRL: A Model-Agnostic Finetuning Adapter via Human-in-the-loop Residual Reinforcement Learning
Recent advancements in generative imitation learning have significantly propelled the field of robotic manipulation. However, the majority of existing models rely heavily on Behavior Cloning (BC), a paradigm that suffers from compounding errors and distributional shift. Consequently, the efficacy of these models in practical industrial deployments remains limited. To address these challenges, we introduce a novel, plug-and-play fine-tuning pipeline designed to facilitate the robust deployment of Vision-Language-Action (VLA) models in real-world environments. In contrast to contemporary reinforcement learning (RL) fine-tuning strategies, which are often constrained by specific model architectures, our proposed framework is model-agnostic and adaptable to a diverse range of VLA models. We conceptualize VLA-generated actions as a unified interface, upon which we train a residual policy. This policy is designed to rectify suboptimal actions and address the distributional shift inherent in imitation learning. Additionally, we incorporate human-in-the-loop guidance to ensure safe exploration and maximize training efficiency. We conduct experiments directly in real-world robotic settings. The results demonstrate that within only 1.5 hour of real-world online RL training, the average success rate exceeds 95% on real robots. Our work presents a practical solution for deploying behavior cloning models in industrial scenarios.
comment: 8 pages, 9 figures
FPAS: Frontier-Based Path Planning with Adaptive Sampling for Large-Scale Unknown Environments IROS 2026
In this work, we propose Frontier-based Path Planning with Adaptive Sampling (FPAS), a novel framework designed for efficient goal-reaching in large-scale, unknown environments. While existing planners often struggle with computational bottlenecks or inefficient paths during long-range navigation, FPAS overcomes these challenges by reinterpreting the frontier concept for goal-directed tasks. Specifically, our method leverages frontiers to effectively guide forward progression into unobserved regions and to select promising subgoals for backtracking from dead-ends or inefficient paths. Furthermore, FPAS introduces an adaptive sampling mechanism based on a frontier-derived openness metric. This mechanism dynamically adjusts the global graph's density by employing sparse nodes in open areas to alleviate computational burdens, while preserving denser sampling in narrow passages to ensure connectivity. Extensive evaluations demonstrate that FPAS substantially improves computational efficiency over baseline methods while maintaining highly competitive goal-reaching performance.
comment: IROS 2026
Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA
We present Cloak, a training recipe that endows a Vision-Language-Action (VLA) model with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training, we augment the mask so the model generalizes to embodiments unseen at training time. We demonstrate the recipe with Cloak-VLA, a VLA trained with Cloak on a single parallel-jaw gripper dataset. No data of new embodiments is ever collected. Cloak-VLA transfers zero-shot to various unseen embodiments, including another gripper, another arm, and a five-fingered hand, while preserving the source embodiment's performance. By decoupling the wrist view from its own embodiment, Cloak allows data to outlive the hardware it was collected on.
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models
Mainstream Fast-Slow dual system vision-language-action models decouple a high-frequency action expert from a low-frequency vision-language model for efficiency, yet they face a fundamental frequency dilemma: large update gaps cause semantic drift from stale context, while small gaps erode the intended computational savings. Moreover, because the action expert receives only the VLM's final-layer representation at a single fixed frequency, rich intermediate features are discarded, limiting both information coupling and manipulation precision. Inspired by multi-timescale neural processing in the human brain, we introduce UniFS, a unified fast-to-slow architecture that resolves these challenges through three key designs. First, we stratify the VLM layers into groups with progressively decreasing update frequencies, enabling shallow layers to capture fast-changing dynamics while deeper layers cache stable semantic context. Second, a latent vector inversion mechanism re-routes the interaction order between multi-scale VLM features and the action expert, aligning fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning. Third, a multi-level supervision strategy enforces a coarse-to-fine learning hierarchy across temporal scales. Together, these designs enable richer cross-frequency information transfer within a single backbone, while the low-frequency pathways additionally preserve temporal context across steps. Experiments on LIBERO show that UniFS achieves state-of-the-art performance (98.3\% average success rate, a 2.5\% gain over VLA-Adapter baseline) while reducing average inference latency from 36.5~ms to 17.8~ms (2.1$\times$ speedup). Real-robot experiments on a Franka platform further validate its practical applicability. Code is opensourced at https://github.com/linsun449/UniFS.
comment: Code is opensourced at https://github.com/linsun449/UniFS
Cooperative-ORCA*: Real-Time Proactive Deadlock Avoidance for Continuous-Space Multi-Agent Navigation
Multi-Agent Path Finding (MAPF) is a problem that requires computing collision-free paths for a set of agents from their start locations to designated goal locations. The problem has broad applications in domains where teams of robots must operate in a coordinated manner. ORCA* is a real time MAPF solver that assigns for each timestep a velocity for each agent. Due to its real time nature, it is myopic to future deadlocks that result from current decisions. ORCA*-MAPF attempts to remedy this limitation by introducing fallback mechanisms when deadlocks are detected. However, post hoc interventions often introduce significant flowtime overhead. In this paper, we introduce C-ORCA* and C-ORCA*-MAPF, continuous space MAPF algorithms that incorporate agents' entire spatial trajectory and their spatial dependencies to proactively prevent deadlocks from occurring, thus avoiding the high flowtime overhead associated with post hoc corrections in ORCA*-MAPF. The C-ORCA* family of algorithms significantly outperform previous state-of-the-art in terms of solve rate, runtime, and flowtime.
HERCULES: An Open-Source Simulation Framework for Heterogeneous Multi-Robot SLAM, Collaborative Perception, and Exploration
We present HERCULES, an open-source simulator and data-collection pipeline for heterogeneous multi-robot autonomy. Built upon the Unreal Engine 5 (UE5)-based simulators AirSim and Cosys-AirSim, HERCULES resolves key architectural limitations of prior frameworks to enable concurrent unmanned aerial and ground vehicle (UAV-UGV) operation in large-scale, photorealistic, dynamic environments. It introduces a new waypoint-tracking UGV controller that mirrors existing UAV control interfaces, and provides a shared navigation stack for mapping, traversability analysis, planning, and control across heterogeneous platforms. Expanding inherited sensor suites, it adds physics-based long-wave infrared (LWIR) cameras and configurable night-vision modes for degraded visual environments. HERCULES provides lightweight APIs, ROS 2 wrappers, and rigorous time synchronization across sensors and platforms, and brings state-of-the-art game-engine capabilities into robotics simulation, integrating intelligent agents such as pedestrians, traffic, and wildlife with high-fidelity dynamic phenomena, including fire, flooding, and crop disease spread. HERCULES runs in two modes: passively, replaying offline-designed trajectories to generate reproducible multi-modal datasets, and actively, running an online planner in closed loop from live observations. Our experiments in heterogeneous multi-robot SLAM, collaborative perception, and exploration, using both HERCULES-generated data and active closed-loop execution, demonstrate its utility for advancing heterogeneous multi-robot autonomy. We publicly release our source code, experiment code, documentation, and datasets, including a heterogeneous multi-robot SLAM benchmark collected with two UAVs and two UGVs across kilometer-scale desert, forest, and city environments, at https://lunarlab-gatech.github.io/HERCULES-website.
comment: 19 pages, 14 figures, and 12 tables
Temporal Logic Guidance for Action-Only Diffusion Policies with World Models ICRA 2026
Diffusion policies enable multimodal robot behavior but offer limited ability to choose among behavior modes at inference time, even though such control is desirable in human-robot settings. Prior solutions to this lack of control have utilized Signal Temporal Logic (STL) to express human intentions and provide corresponding guidance for diffusion policy inference. However, these approaches can only guide diffusion policies that jointly generate future actions and states, increasing both complexity and runtime. We propose a novel guidance method for action-only diffusion policies that uses a separate learned world model to enable differentiable evaluation of STL robustness, with its gradient then injected into the diffusion process. This steers behavior toward constraint satisfaction without retraining, improving constraint adherence while preserving task performance. On the Can Transport task from Robomimic, our method maintains 100% task success while reducing constraint violations from over 80% for baseline methods to 4%. We also discuss extensions toward improved robustness and more complex constraints.
comment: Accepted at the ICRA 2026 Workshop on Bridging the Gap between Robot Learning and Human-Robot Interaction. 3 pages, 2 figures, 1 table
Critique of Agent Model
What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same time, ``existential" concerns such as AI escaping human control with destructive power under a speculative ``machine agency" against humans, it has become essential to clarify where automation ends and agency begins, both for building capable systems and for understanding whether and what to fear. Drawing on Descartes' grounding of agency in independent thought, and on portrayals of autonomous beings in science fiction, we survey the current landscape of AI agents, and analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. Specifically, we argue that genuine agency requires these structures to be \emph{internalized within the system itself} rather than assembled through external scaffolding. This distinction between \emph{agentic} systems, whose competence resides in engineered workflows, and \emph{agentive} systems, whose capabilities (including social interaction) arise endogenously, defines the boundary between systems designed for prescribed tasks, and those capable of operating in the open world with true autonomy. Building on this analysis, we propose the Goal-Identity-Configurator (GIC) architecture for a general-purpose agent model, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from both real and simulated experience. Furthermore, we share insight on the auditability, controllability, and safety of agentive systems that possess greater autonomy and ``agency", but remain under human oversight.
Topological Online Learning for Displacement-based Formation Control
This paper addresses the problem of robust formation control by introducing Topological Online Learning for Displacement-based (TOLD) formation control, a real-time edge-level adaptation framework. Unlike conventional node-level robust controllers that regulate individual robot inputs without modifying the interaction topology, TOLD updates the interaction topology weights online to directly minimize formation distortion. Two strategies are proposed under the TOLD formation control framework: Online Gradient Flow (OGF) with unconstrained weights and Online Exponential Gradient Flow (OExpGF) with non-negative convex weights. Theoretical analysis establishes that, for single-integrator agents over directed graphs, OExpGF guarantees asymptotic consensus, while OGF ensures bounded formation distortion. Simulations with twelve robots under intermittent disturbances show 1.2%-33.14% median cumulative Root Mean Distortion Error reduction when augmenting TOLD with node-level controllers. Hardware experiments with Crazyflie 2.0 quadrotors demonstrate over 62% (OGF) and 31.4% (OExpGF) reduction in median formation distortion compared to fixed-weight consensus.
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization
Reinforcement learning can train bimanual dexterous hands to play piano in physics simulation with high note accuracy, but for high-DoF dexterous hands, relying solely on task rewards or IK inversion often leads to unnatural postures and joint overextension. We propose \textit{Adversarial Posture Regularization (APR)}. It avoids expensive, song-aligned expert demonstration data and instead uses a small amount of casual human playing data. By matching the distribution of the posture of the policy with the human prior through an adversarial objective, APR encourages more human-like hand shapes. Meanwhile, we collect and release unstructured hand motion data of piano playing using a consumer-grade Meta Quest 3, and retarget the key motion information to the Shadow Hand. Finally, we achieve significantly better performance than prior methods on all three human-likeness metrics (cPSI, BSE, and FAC) as well as in visual quality. Project repository: https://github.com/APRProject/APRPianist.
Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors
The use of dedicated corridors for Advanced Air Mobility (AAM) traffic is one of the most commonly proposed pathways to integrating them into existing airspace operations. Most prior research has focused on the design of networks of AAM corridors and conflict resolution for aircraft within corridors. It is also generally believed that while attractive from an implementation perspective, corridor-based operations may be inefficient, especially in the absence of centralized traffic management. In this paper, we show that contrary to this belief, it is possible for autonomous aircraft to learn to self-organize into corridor flows in decentralized settings. We illustrate our approach using scenarios in which fixed-wing aircraft need to safely and efficiently traverse (1) a single corridor with metering after the exit, (2) a sequence of two consecutive corridors, and (3) a corridor that splits into two. We find that in decentralized settings with only local information, the aircraft are able to conform to the corridor boundaries more than 94% of the time and reach their goal in a relatively efficient manner. Furthermore, tactical interventions to handle violations of the separation minimum are needed only infrequently in low- and medium-density settings. However, such tactical interventions become more frequently necessary only when traffic density is high.
comment: Presented at the AIAA SciTech 2026 Forum
Engineering Reliable Autonomous Systems: Challenges and Solutions
Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop "Engineering Reliable Autonomous Systems" (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration.
Verifiable Foundation Models for Robot Safety
Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable for existing verification tools. In this paper, we present FEARL (Foundation-Enabled Assured Robot Learning), a framework that addresses this tension through a modular architectural decomposition. FEARL separates the policy into a large Controller (C) responsible for high-dimensional perception and task reasoning, and a small Safety module (S) that receives low-dimensional observations from dedicated safety sensors together with a bounded context embedding from C and produces the final action. Since many robot safety requirements, such as collision avoidance and workspace boundary constraints, can be expressed over these safety sensor observations, formal verification can be applied to S rather than to the full foundation-model backbone. This makes formal analysis tractable with existing tools while preserving the Controller's expressive power for task reasoning. To show that the decomposed policy remains capable of solving diverse tasks, we evaluate FEARL on three simulated robotic domains using multiple Controller backbones and training procedures, including pretrained off-the-shelf vision-language-action models. We further transfer the learned policy from one of our simulated tasks to a physical robot, suggesting that the low-dimensional safety interface supports practical sim-to-real transfer.
Schur-MI: Fast Mutual Information for Robotic Information Gathering IROS 2026
Mutual information (MI) is a principled and widely used objective for robotic information gathering (RIG), providing strong theoretical guarantees for sensor placement (SP) and informative path planning (IPP). However, its high computational cost - dominated by repeated log-determinant evaluations - has limited its use in real-time planning. This paper presents Schur-MI, a Gaussian process (GP) MI formulation that (i) leverages the iterative structure of RIG to precompute and reuse expensive intermediate quantities across planning steps, and (ii) uses a Schur-complement factorization to avoid large determinant computations. Together, these methods reduce the per-evaluation cost of MI from $\mathcal{O}(|\mathcal{V}|^3)$ to $\mathcal{O}(|\mathcal{A}|^3)$, where $\mathcal{V}$ and $\mathcal{A}$ denote the candidate and selected sensing locations, respectively. Experiments on real-world bathymetry datasets show that Schur-MI achieves up to a $12.7\times$ speedup over the standard MI formulation. Field trials with an autonomous surface vehicle (ASV) performing adaptive IPP further demonstrate the method's practicality. By making MI computation tractable for online planning, Schur-MI helps bridge the gap between information-theoretic objectives and real-time robotic exploration. Our code is available at: www.sgp-tools.com
comment: IROS 2026
MILE: A Mechanically Isomorphic Exoskeleton Data Collection System with Fingertip Visuotactile Sensing for Dexterous Manipulation
Imitation learning provides a promising approach to dexterous hand manipulation, but its effectiveness is limited by the lack of large-scale, high-fidelity data. Existing data-collection pipelines suffer from inaccurate motion retargeting, low data-collection efficiency, and missing high-resolution fingertip tactile sensing. We address this gap with MILE, a mechanically isomorphic teleoperation and data-collection system co-designed from human hand to exoskeleton to robotic hand. The exoskeleton is anthropometrically derived from the human hand, and the robotic hand preserves one-to-one joint-position isomorphism, eliminating nonlinear retargeting and enabling precise, natural control. The exoskeleton achieves a multi-joint mean absolute angular error below one degree, while the robotic hand integrates compact fingertip visuotactile modules that provide high-resolution tactile observations. Built on this retargeting-free interface, we teleoperate complex, contact-rich in-hand manipulation and efficiently collect a multimodal dataset comprising high-resolution fingertip visuotactile signals, RGB-D images, and joint positions. The teleoperation pipeline achieves a mean success rate improvement of 64%. Incorporating fingertip tactile observations further increases the success rate by an average of 25% over the vision-only baseline, validating the fidelity and utility of the dataset. Further details are available at: https://sites.google.com/view/mile-system.
comment: 18 pages including supplementary material. Main manuscript and supplementary material included in this version
UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning
Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.
Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters ICPR 2026
Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts. We establish a theoretical foundation connecting Shapley values to generalizability, empirically analyze configuration impact patterns, and introduce SHAP-guided configuration selection to enhance generalization. Our results reveal distinct patterns across algorithms and hyperparameters, with consistent configuration impacts across diverse tasks and environments. By applying these insights to configuration selection, we achieve improved RL generalizability and provide actionable guidance for practitioners.
comment: 16 pages, 7 figures, accepted by ICPR 2026
Stealthy World Model Manipulation via Data Poisoning
Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.
comment: 41 pages, 8 figures, 11 tables
Seam-to-Graph Reconstruction for Garment Configuration Alignment
Seams encode rich structural information about garments but are frequently partially observable in robotic manipulation scenarios. To robustly leverage seam information, we propose a Seam-to-Graph network based on graph neural networks and attention mechanisms. This network maps unstructured seam observations to a topology-encoded structural skeleton graph for real-time garment state estimation. Using this skeleton-graph-based state estimation, we design a deformation-aware, hierarchical visual servoing controller for garment configuration alignment. We implement this controller on a bimanual robot system to load a garment onto a screen printing platen and to align it to the desired configuration precisely. Real-robot experiments demonstrate that the robot using the proposed method not only achieves human-level alignment accuracy with reduced variance in alignment error but is also robust to different garments. These results demonstrate that the use of seam information is effective for garment manipulation.
comment: 11 pages, 9 figures
Stable Transformer-Actor-Critic Model Predictive Control: A Contraction Analysis Approach
Actor-Critic Model Predictive Control (MPC) effectively addresses complex, non-convex control problems, but guaranteeing the closed-loop stability of sequence-based learning models within these pipelines remains challenging. This paper introduces a novel Transformer-Actor-Critic MPC architecture with formal robustness guarantees. First, we prove that Transformer networks can satisfy global incremental Input-to-State Stability ($δ$ISS). We then leverage Riemannian contraction theory to analyze the interconnected dynamics between the physical plant and the predictive neural network. Finally, we integrate these theoretical bounds as a training regularizer to yield a certifiably robust policy. The framework is validated on a nonlinear 3D drone model executing target-reaching and obstacle-avoidance maneuvers.
Bracing for Impact: Robust Humanoid Push Recovery and Locomotion with Reduced Order Models
Push recovery during locomotion will facilitate the deployment of humanoid robots in human-centered environments. In this paper, we present a unified framework for walking control and push recovery for humanoid robots, leveraging the arms for push recovery while dynamically walking. The key innovation is to use the environment, such as walls, to facilitate push recovery by combining Single Rigid Body model predictive control (SRB-MPC) with Hybrid Linear Inverted Pendulum (HLIP) dynamics to enable robust locomotion, push detection, and recovery by utilizing the robot's arms to brace against such walls and dynamically adjusting the desired contact forces and stepping patterns. Extensive simulation results on a humanoid robot demonstrate improved perturbation rejection and tracking performance compared to HLIP alone, with the robot able to recover from pushes up to 100N for 0.2s while walking at commanded speeds up to 0.5m/s. Robustness is further validated in scenarios with angled walls and multi-directional pushes.
comment: Accepted to the 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids 2025). Copyright transferred to IEEE
SHIELD: Safety on Humanoids via CBFs In Expectation on Learned Dynamics IROS 2025
Robot learning has produced remarkably effective ``black-box'' controllers for complex tasks such as dynamic locomotion on humanoids. Yet ensuring dynamic safety, i.e., constraint satisfaction, remains challenging for such policies. Reinforcement learning (RL) embeds constraints heuristically through reward engineering, and adding or modifying constraints requires retraining. Model-based approaches, like control barrier functions (CBFs), enable runtime constraint specification with formal guarantees but require accurate dynamics models. This paper presents SHIELD, a layered safety framework that bridges this gap by: (1) training a generative, stochastic dynamics residual model using real-world data from hardware rollouts of the nominal controller, capturing system behavior and uncertainties; and (2) adding a safety layer on top of the nominal (learned locomotion) controller that leverages this model via a stochastic discrete-time CBF formulation enforcing safety constraints in probability. The result is a minimally-invasive safety layer that can be added to the existing autonomy stack to give probabilistic guarantees of safety that balance risk and performance. In hardware experiments on an Unitree G1 humanoid, SHIELD enables safe navigation (obstacle avoidance) through varied indoor and outdoor environments using a nominal (unknown) RL controller and onboard perception.
comment: Accepted to the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). Copyright transferred to IEEE. Video at https://youtu.be/-Qv1wR4jfj4
eMEM: A Hybrid Spatio-Temporal Memory System For Embodied Agents
We present eMEM (Embodied Memory), a hybrid graph-based memory system for embodied agents operating in physical environments. Current agent memory architectures, such as Generative Agents, MemGPT, and A-MEM, treat memory as text streams or knowledge graphs, but embodied agents require memory that is simultaneously searchable by meaning, space, and time. eMEM fills this gap with a multi-index architecture (SQLITE for structured storage, hnswlib for approximate nearest neighbour semantic search, and an R-tree for spatial queries) unified behind a single graph model. A tiered consolidation pipeline transforms raw perceptual observations into compressed summaries, mirroring hippocampal-neocortical consolidation in biological systems. Ten agent-facing recall tools expose memory retrieval primitives, including concept-to-location resolution and cross layer recall, as first-class operations for LLM tool calling. The system is fully embedded and runs in-process alongside the agent. In addition we introduce eMEM-Bench v1, a benchmark we construct over ProcTHOR-10K scenes for embodied memory evaluation. The benchmark is organised explicitly around eight cognitive-psychology paradigms (DRM lures, pattern separation, pattern completion, source monitoring, context-dependent retrieval, long-horizon interference, serial position, and a foil augmented retention curve), each chosen so that the result is interpretable against the broader memory-systems literature in humans and prior agent-memory systems; a level of diagnostic that surface-task benchmarks like LoCoMo or OpenEQA cannot provide. eMEM scores 80.8 weighted mean over 988 probes, with a flat retention curve at ceiling from 1 h to 1 yr of simulated delay on room-unique items. We show that a pure RAG baseline (the flat_rag ablation) loses 30 pt on context dependent retrieval and 29 pt on DRM lure rejection, isolating the contribution of multi-layer storage and consolidation respectively. We release both the system and the benchmark code.
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026). Copyright transferred to IEEE. Sample code for the navigation example with CBF-RL reward core construction can be found at https://github.com/lzyang2000/cbf-rl-navigation-demo
Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions IROS 2026
Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to safely navigate semantically rich, dynamic environments with context-dependent safety margins.
comment: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026). Copyright transferred to IEEE
N2M: Bridging Navigation and Manipulation by Learning Pose Preference from Rollout
Determining where to execute the manipulation policy is a fundamental challenge in mobile manipulation. Most approaches have formulated this as a geometric search problem, prioritizing physical reachability. However, given the high sensitivity of modern learning-based manipulation policies, geometric criteria alone are insufficient. Optimal performance requires base positioning that is aware of the policy's preference. While recent works have attempted to address this, they remain limited in practicality due to reliance on pre-built scene reconstruction and slow inference. In this work, we introduce N2M that systematically reformulates the approach to base positioning problem, naturally overcoming limitations of previous methods. Our key insight is that policy preferences are inherent to the local scene structure and can be effectively learned from the policy rollouts. Technically, we propose a novel viewpoint augmentation strategy that enables the model to learn robust, viewpoint-invariant pose preferences with remarkable data efficiency. Extensive experiments demonstrate that N2M achieves state-of-the-art performance, outperforming both non-policy-aware baselines and recent policy-aware alternatives. Furthermore, we provide a comprehensive analysis highlighting N2M's broad applicability, generalization capabilities, and data efficiency. Project website: https://clvrai.github.io/N2M/
TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation
Human hand-object demonstrations provide dense reference motions for training dexterous manipulation reinforcement learning (RL) policies through reference tracking. However, to use such demonstrations for RL policy learning, retargeting must preserve hand pose and task-relevant hand-object contact structure. Otherwise, contact and feasibility artifacts can degrade downstream RL policy performance. We introduce TopoRetarget, an interaction-preserving retargeting framework that uses a single set of parameters across diverse retargeting conditions while maintaining task-relevant hand-object interaction and adapting human demonstrations to dexterous robot hands. The method constructs a sparse interaction graph over hand and object keypoints and optimizes distance-weighted Laplacian deformation with directional consistency, kinematic constraints, and penetration handling. Evaluations show that the generated references improve both interaction fidelity and policy learning: TopoRetarget achieves the best contact precision and alignment over all baselines on the ContactPose Dataset, improves Pen-Spin training success by 40.6 percentage points over the existing baseline methods, and enables zero-shot transfer to Wuji Hand hardware on cube reorientation and pen spinning.
comment: Project page: https://toporetarget2026.github.io/TopoRetarget/
An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays
In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.
comment: 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice
A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems
Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.
GO: The Great Outdoors Multimodal Dataset
The Great Outdoors (GO) dataset is a multi-modal annotated data resource aimed at advancing ground robotics research in unstructured environments. Existing off-road datasets often lack sensor diversity and exclude vital modalities like thermal and radar that are critical for operation in degraded conditions (e.g., low visibility or adverse weather). To address these gaps, we introduce a large-scale multimodal off-road dataset with six complementary sensor modalities, along with semantic annotations and GPS traces, to support tasks such as semantic segmentation, object detection, and SLAM. The diverse environmental conditions represented in the dataset present significant real-world challenges, which provide opportunities to develop more robust solutions to support the continued advancement of field robotics, autonomous exploration, and perception systems in natural environments. The dataset can be downloaded at: https://www.unmannedlab.org/the-great-outdoors-dataset/
comment: 7 pages, 7 figures, accepted at IV 2026
Geometric Action Model for Robot Policy Learning
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
comment: Project page: https://cvlab-kaist.github.io/Geometric-Action-Model/
Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation ICML
Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. However, prior approaches largely treat generated planning domains as planning utilities, which are brittle under imperfect logical states and perception noise, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (1) enabling large-scale generation of robotic planning problem-plan pairs, and (2) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and resistance to imperfect logical states. GitHub Page: https://markli1hoshipu.github.io/Plan_LLM/
comment: International Conference on Machine Learning (ICML) 2026
Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation
Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, language for cognitive planning and reasoning, and language in unified vision-language-action models. Specifically, we further analyze state-of-the-art techniques from five axes of action granularity, data and supervision regimes, system cost and latency, environments and evaluations, and task specification. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.
From Singleton Obstacles to Clutter: Translation Invariant Compositional Avoid Sets
This paper studies obstacle avoidance under translation invariant dynamics using an avoid-side travel cost Hamilton Jacobi formulation. For running costs that are zero outside an obstacle and strictly negative inside it, we prove that the value function is non-positive everywhere, equals zero exactly outside the avoid set, and is strictly negative exactly on it. Under translation invariance, this yields a reuse principle: the value of any translated obstacle is obtained by translating a single template value function. We show that the pointwise minimum of translated template values exactly characterizes the union of the translated single-obstacle avoid sets and provides a conservative inner certificate of unavoidable collision in clutter. To reduce conservatism, we introduce a blockwise composition framework in which subsets of obstacles are merged and solved jointly. This yields a hierarchy of conservative certificates from singleton reuse to the exact clutter value, together with monotonicity under block merging and an exactness criterion based on the existence of a common clutter avoiding control. The framework is illustrated on a Dubins car example in a repeated clutter field.
Multiagent Systems
MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?
Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents' roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.
comment: Project page: https://juyangbai.github.io/MAS-PromptBench/ ; Code: https://github.com/juyangbai/MAS-PromptBench
Decentralized Autonomous Traffic Management through Corridor Networks
As autonomous aircraft are introduced at scale and traffic density increases, centralized management becomes insufficient to coordinate the large numbers of crewed and uncrewed aircraft. Dedicated Advanced Air Mobility (AAM) corridors have therefore been proposed for organizing high-density autonomous traffic flows. The desire to scalably provide autonomous aircraft flexibility in trajectory planning motivates the development of decentralized approaches to traffic management in AAM corridors. In this work, we extend a multi-agent reinforcement learning (MARL) approach to address the challenge of decentralized traffic flow management in air corridor networks. We test policies trained in a single-corridor setting on increasingly complex multi-corridor networks with combinations of merges and splits in a zero-shot manner. Experimental results demonstrate that learned behaviors transfer well to scenarios with varying traffic density, network geometry, and heterogeneous vehicle performance, without needing centralized coordination or model retraining. We evaluate system-level performance in terms of conformance to corridor boundaries, completion rates, average speeds, distance traveled, and maintenance of inter-aircraft separation. We find that although our policies require only locally coordinated entry, traversal, and exit behaviors, they collectively produce desirable traffic flows through the corridor network.
comment: Presented at the Second US-Europe Air Transportation Research and Development Symposium (ATRDS2026)
Decomposing Financial Market Dynamics via Mechanism Analysis in an Evolutionary Multi-Agent Simulation
Evolutionary agent-based markets (ABMs) couple several mechanisms -- who reproduces, how price forms, how biased the agents are, how consensus propagates -- yet these are usually fixed by convention, so it is unclear which mechanism controls which emergent property. In a coevolving, endogenous-price simulator with 120 heterogeneous behavioral agents, we make four mechanisms pluggable and run matched 3x20-seed interventions. We find the levers are largely separable. (1) Selection -> diversity: a Quality-Diversity (QD/MAP-Elites) operator robustly raises strategy-mix entropy over truncation top-k (paired Delta entropy +0.27 to +1.12 bits; sign-test p<0.001; CIs exclude 0) and sustains more strategy cycling (strongest in crisis: Delta=+0.070, p=0.0004). (2) Selection does not improve realism: even a per-agent realism reward that provably steers selection does not raise 5-fact realism (Delta_5=-0.11,-0.08,+0.03; not significant). (3) Microstructure -> realism: enabling reflexive price feedback does raise realism (Delta_5=+0.13,+0.20,+0.20; crisis/bull p<0.05, all CIs positive). (4) Behavior -> fragility: amplifying behavioral bias raises a genomic fragility proxy (Delta=+10.5,+11.1,+14.4; bull p<0.001, all CIs positive) while leaving realism flat. The remaining mechanism -- consensus network topology -- shows no robust effect (honest null). The contribution is a decomposition: in these single-mechanism sweeps the mechanisms behave as approximately distinct control knobs over diversity, realism, and fragility.
comment: 6 pages, 2 figures
RaMem: Contextual Reinstatement for Long-term Agentic Memory
Long-term memory has become increasingly important for LLM agents that operate across extended interactions and evolving task contexts. Recent memory systems have made past experiences more persistent, compact, and retrievable, but retrieval alone does not ensure that a memory provides valid evidence for the current query. When experiences are compressed into reusable fragments, memories from different situations may appear equally relevant if they involve recurring entities or user states. We refer to this failure as context collapse: memories lose the surrounding context needed to judge whether they provide valid evidence for the current query. To address this problem, we propose Contextual Reinstatement for Agentic Memory (RaMem), a framework that turns retrieved memory fragments into contextually verifiable evidence. RaMem operates through four coordinated stages: (i) evidence anchoring grounds each memory in its original episodic conditions, especially event time, mention time, session span, and participants; (ii) recall condition induction derives the evidence conditions implied by the query; (iii) validity-aware retrieval uses these conditions to prioritize context-compatible memories while retaining content-relevant candidates as fallback evidence; and (iv) context-preserved synthesis keeps the selected memories' structured context available to the generator. Experiments on long-term memory benchmarks show that RaMem consistently improves performance over strong memory baselines, with average F1 gains of more than 10% across several backbones.
HERCULES: An Open-Source Simulation Framework for Heterogeneous Multi-Robot SLAM, Collaborative Perception, and Exploration
We present HERCULES, an open-source simulator and data-collection pipeline for heterogeneous multi-robot autonomy. Built upon the Unreal Engine 5 (UE5)-based simulators AirSim and Cosys-AirSim, HERCULES resolves key architectural limitations of prior frameworks to enable concurrent unmanned aerial and ground vehicle (UAV-UGV) operation in large-scale, photorealistic, dynamic environments. It introduces a new waypoint-tracking UGV controller that mirrors existing UAV control interfaces, and provides a shared navigation stack for mapping, traversability analysis, planning, and control across heterogeneous platforms. Expanding inherited sensor suites, it adds physics-based long-wave infrared (LWIR) cameras and configurable night-vision modes for degraded visual environments. HERCULES provides lightweight APIs, ROS 2 wrappers, and rigorous time synchronization across sensors and platforms, and brings state-of-the-art game-engine capabilities into robotics simulation, integrating intelligent agents such as pedestrians, traffic, and wildlife with high-fidelity dynamic phenomena, including fire, flooding, and crop disease spread. HERCULES runs in two modes: passively, replaying offline-designed trajectories to generate reproducible multi-modal datasets, and actively, running an online planner in closed loop from live observations. Our experiments in heterogeneous multi-robot SLAM, collaborative perception, and exploration, using both HERCULES-generated data and active closed-loop execution, demonstrate its utility for advancing heterogeneous multi-robot autonomy. We publicly release our source code, experiment code, documentation, and datasets, including a heterogeneous multi-robot SLAM benchmark collected with two UAVs and two UGVs across kilometer-scale desert, forest, and city environments, at https://lunarlab-gatech.github.io/HERCULES-website.
comment: 19 pages, 14 figures, and 12 tables
Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint's best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent's code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games ICML 2026
Recent work has established that regularized policy gradient methods such as PPO, when used in self-play, can match or exceed specialized game-theoretic algorithms for solving two-player zero-sum imperfect-information games. The uniform distribution has emerged as a strong policy regularization target for this purpose, but it regularizes equally toward all actions regardless of their viability. We introduce EMAgnet, which instead regularizes toward an exponential moving average (EMA) of the last-iterate policy's parameters, providing an adaptive regularization target that evolves with the agent's improving strategy. We evaluate EMAgnet on both standard two-player zero-sum benchmarks and modified benchmarks with exploration challenges and large numbers of strictly dominated strategies. Relative to PPO self-play with uniform-magnet regularization under both linear and power-law annealing schedules, EMAgnet achieves lower exploitability in the majority of tested environments, with consistent performance gains across games containing strictly dominated strategies.
comment: Accepted at NExT-Game 2026: New Frontiers in Game-Theoretic Learning (ICML 2026 Workshop). 13 pages, 2 figures,
Critique of Agent Model
What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same time, ``existential" concerns such as AI escaping human control with destructive power under a speculative ``machine agency" against humans, it has become essential to clarify where automation ends and agency begins, both for building capable systems and for understanding whether and what to fear. Drawing on Descartes' grounding of agency in independent thought, and on portrayals of autonomous beings in science fiction, we survey the current landscape of AI agents, and analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. Specifically, we argue that genuine agency requires these structures to be \emph{internalized within the system itself} rather than assembled through external scaffolding. This distinction between \emph{agentic} systems, whose competence resides in engineered workflows, and \emph{agentive} systems, whose capabilities (including social interaction) arise endogenously, defines the boundary between systems designed for prescribed tasks, and those capable of operating in the open world with true autonomy. Building on this analysis, we propose the Goal-Identity-Configurator (GIC) architecture for a general-purpose agent model, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from both real and simulated experience. Furthermore, we share insight on the auditability, controllability, and safety of agentive systems that possess greater autonomy and ``agency", but remain under human oversight.
Maestro Order: A Model-Agnostic Orchestration Harness
A single forward pass of a capable model is a fast, fluent, and unreliable problem-solver: it is right often enough to be useful and wrong often enough to be dangerous; in language models, such confident errors are known as hallucinations. We present Maestro Order, a model-agnostic orchestration harness that turns unreliable solvers into reliable problem-solving systems by composing them according to four structural primitives (decompose, ensemble, verify, and recurse) and a budget-aware controller that decides where to spend compute. The harness treats any model as a black-box base solver behind a uniform interface, layers a verifier ensemble whose discrimination is measured online, and allocates verification and voting to the stages with the highest marginal reliability per unit cost. We give the architecture, the message and state schema, the controller algorithm, and the engineering that makes it deterministic, observable, and fault-tolerant. We then specify an evaluation methodology (reliability at fixed cost, coverage, calibration, and ablations) and report results from a faithful Monte Carlo simulation of the harness over a parameterized solver/verifier model. The simulation reproduces the predicted laws quantitatively: verification amplifies reliability geometrically (e.g. $0.55\to0.98$ with two gates, $\to0.999$ with four), voting helps only above chance and is limited by shared errors, and a budget-aware controller reaches a target reliability at a small fraction of the cost of voting alone by selecting the cheapest mechanism for each regime. We close with failure modes (verifier gaming, correlated errors, and decomposition error compounding) and concrete guidance: build robust checkers, diversify solvers, and let the controller put compute where the information is.
comment: 10 pages, 4 figures
Welfarist Control Design -- How to fulfill the societal mandate in multi-agent control?
At the core of most socio-technical systems lies a scarce resource that is allocated among agents: highway lanes, public transit, road space, water rights, energy access, grid capacity, user attention, pollution rights, etc. With further automation of the underlying allocation processes, control engineers are increasingly tasked to make decisive assumptions regarding what society wants. In practice to date, design choices are largely driven by industry norms and conventions rather than a result of conscientiously responsible and ethical design. In this paper, we look at tools available to control engineers to design systems in a more principled manner in order to match the societal mandate. We consider three control design paradigms: online feedback optimization, control of Markov decision processes, and model predictive control. Beginning with aggregating individual agents' preferences into control design objectives, subsequently ensuring and certifying the fulfillment of those specifications, we argue that the feedback nature of control systems enables appropriate allocation of the shared resources in ways hitherto unparalleled.
Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors
The use of dedicated corridors for Advanced Air Mobility (AAM) traffic is one of the most commonly proposed pathways to integrating them into existing airspace operations. Most prior research has focused on the design of networks of AAM corridors and conflict resolution for aircraft within corridors. It is also generally believed that while attractive from an implementation perspective, corridor-based operations may be inefficient, especially in the absence of centralized traffic management. In this paper, we show that contrary to this belief, it is possible for autonomous aircraft to learn to self-organize into corridor flows in decentralized settings. We illustrate our approach using scenarios in which fixed-wing aircraft need to safely and efficiently traverse (1) a single corridor with metering after the exit, (2) a sequence of two consecutive corridors, and (3) a corridor that splits into two. We find that in decentralized settings with only local information, the aircraft are able to conform to the corridor boundaries more than 94% of the time and reach their goal in a relatively efficient manner. Furthermore, tactical interventions to handle violations of the separation minimum are needed only infrequently in low- and medium-density settings. However, such tactical interventions become more frequently necessary only when traffic density is high.
comment: Presented at the AIAA SciTech 2026 Forum
From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes
Graph and multi-agent orchestration frameworks make production large language model (LLM) workflows practical, but they do not by themselves solve conversational continuity when users maintain several interdependent objectives. This conceptual systems paper focuses on the high-complexity end of that design space, where goals can be suspended, resumed, revised, and invalidated by actions in other goals. We introduce the Goal-Oriented Dialogue Runtime (GODR), a framework-neutral design pattern that treats goals, task frames, lifecycle state, invalidation rules, and resumption contracts as first-class runtime objects while delegating bounded execution to graph runtimes, agents, tools, or application programming interfaces (APIs). GODR is not proposed as a replacement for workflow graphs in simple guided processes; it is intended for complex, multi-domain, interruptible conversations where objective continuity cannot be recovered reliably from agent identity, chat history, or execution-graph position alone. The paper formalizes the problem, proposes runtime objects and architecture-selection criteria, and frames evaluation as an agenda for future empirical validation rather than as a measured performance claim.
comment: 21 pages, 7 figure, 10 tables
Emergent Relational Order in LLM Agent Societies: From Collective Affect to Authority Stratification ACL 2026
Fei Xiaotong's Differential Order Pattern characterizes rural society as egocentric and relationally graded, with cooperation attenuating over social distance. Although often treated as culturally specific, its mechanistic basis remains under-operationalized, and prior LLM-based simulations have mainly addressed short-term coordination rather than long-horizon social structure. We propose CAREB-MAS, a multi-agent framework grounded in Affect Control Theory, Social Identity Theory, and Durkheimian collective affect. Agents reason through an emotion-ethics-belief chain and maintain dynamically evolving egocentric identities, while the macro environment specifies only individual production, preference-based allocation, and minimal interaction protocols. Across long-horizon simulations, agents spontaneously reproduce five core Differential Order phenomena: stable labor specialization, guanxi-based economic ethics, relational decay of cooperation, emergent relational authority, and clan-based center-periphery stratification. These patterns shift with production structure from kin-centered integration toward greater functional interdependence. Extensive experiment results support interpreting Differential Order as a structure-sensitive emergent outcome of general social mechanisms, with LLM-based multi-agent simulation providing an interdisciplinary framework for studying social structure and change.
comment: Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 37 pages
Engineering Reliable Autonomous Systems: Challenges and Solutions
Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop "Engineering Reliable Autonomous Systems" (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration.
Digital Speech Acts Retain Control of Copyright with People, Not Platforms
Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a digital speech act -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (i) digital speech acts qualify for copyright protection under existing U.S. precedent: Burrow-Giles locates authorship in volitional creative choices despite mechanical or algorithmic processes, Feist supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (ii) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that legal ownership and physical possession coalesce in the person; and (iii) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance.
Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training
The generation of sustained, open-ended complexity from local interactions remains a fundamental challenge in artificial life. Differentiable multi-agent systems, such as Petri Dish Neural Cellular Automata (PD-NCA), exhibit rich self-organization driven purely by spatial competition; however, they are highly sensitive to hyperparameters and frequently collapse into uninteresting patterns and dynamics, such as frozen equilibria or structureless noise. In this paper, we introduce PBT-NCA, a meta-evolutionary algorithm that evolves a population of PD-NCAs subject to a composite objective that rewards both historical behavioral novelty and contemporary visual diversity. Driven by this continuous evolutionary pressure, PBT-NCA spontaneously generates a plethora of emergent lifelike phenomena over extended horizons-a hallmark of true open-endedness. Strikingly, the substrate autonomously discovers diverse morphological survival and self-organization strategies. We observe highly regular, coordinated periodic waves; spore-like scattering where homogeneous groups eject cell-like clusters to colonize distant territories; and fluid, shape-shifting macro-structures that migrate across the substrate, maintaining stable outer boundaries that enclose highly active interiors. By actively penalizing monocultures and dead states, PBT-NCA sustains a state of effective complexity that is neither globally ordered nor globally random, operating persistently at the "edge of chaos".
comment: 10 pages, 12 figures
Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs
We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.
Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda IJCAI
LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.
comment: Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026
Computing Evolutionarily Stable Strategies in Multiplayer Games
We present an algorithm for computing all evolutionarily stable strategies in nondegenerate normal-form games with three or more players.
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
comment: Benchmark harness and code available at http://github.com/siegelz/core-bench
ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms
Progress in computational science depends on complex numerical workflows that must faithfully encode physical laws, yet translating conceptual insight into reliable code remains a major bottleneck. Although large language models can generate isolated code fragments, they lack the structured reasoning required to design, verify, and iteratively refine complete scientific pipelines. Here we introduce ATHENA, an agentic framework explicitly designed to emulate scientific research modeled as a knowledge-driven contextual bandit process. Its core loop separates conceptual policy from numerical realization through expert-derived conceptual scaffolding, enabling principled diagnosis, reformulation, and repair of computational strategies. Across scientific computing and scientific machine learning tasks, ATHENA autonomously derives and correctly applies exact analytical solutions, constructs stable numerical solvers, diagnoses ill-posed formulations, and orchestrates hybrid symbolic-numeric workflows. Quantitatively, ATHENA matches and frequently surpasses the accuracy of expert-authored reference solutions reported in the literature on canonical benchmarks. By reframing computation as an object of agentic reasoning, our framework enables autonomous orchestration of heterogeneous algorithms across scientific domains.
Systems and Control (EESS)
Autonomous Subsea Cable Search and Tracking with Graph-Optimised Priors and Visual Tracking
Global communications rely on subsea cable infrastructure that remains vulnerable to damage from natural hazards and human activity. Autonomous underwater vehicles (AUVs) offer an efficient means to inspect long sections of exposed cable, but uncertainty in cable route maps, small cable diameters and partial burial makes continuous tracking a challenge. This paper presents a novel cable search and tracking method that leverages uncertain prior cable route maps. Graph-based optimisation continuously update the cable route to remain consistent with visual observations. Route uncertainty is constrained as a function of distance from observations using physics-based catenary models that account for cable parameters (i.e., lay depth, diameter, and density), bounding the search space to physically feasible regions and improving search efficiency. Cable detection is performed using a semi-supervised classifier running in real-time on-board a camera-equipped AUV. These detections both update the graph-based optimisation and enable visual cable tracking. When tracking is lost due to misclassification, burial or imperfect control, the bounded search space enables efficient recovery. The approach was demonstrated in field trials using the University of Southampton's Smarty200 AUV. The system successfully located the cable despite deliberate errors in it initial cable route map, updating this to be consistent with observations and using visual tracking to inspect up to 59% of a 120m test cable, with successful recovered after tracking loss.
Two-Stage Optimization for Dynamic Line Rating and Energy Storage Deployment
The increasing penetration of distributed energy resources (DER) and weather-driven variability has intensified congestion and reliability stress in transmission networks. Strategies that enhance the utilization of existing infrastructure, such as static line ratings (SLR) and energy storage systems (ESS), have therefore become necessary. SLRs rely on conservative ambient assumptions and often understate thermal limits, whereas dynamic line ratings (DLR) adjust capacity according to weather conditions and unlock additional transfer capability. Energy storage systems provide temporal flexibility, but their transmission-level effectiveness depends on proper siting and sizing. This paper proposes a two-stage optimization method for joint placement of DLR installations and utility-scale energy storage. In the first stage, a mixed-integer linear program selects DLR corridors and ESS buses by minimizing operating cost, DER curtailment, and load-shedding penalties subject to DC power flow and investment constraints. In the second stage, the model determines ESS energy capacity and operating schedules under ambient-driven line ratings. Ambient weather data is used to generate DLR profiles, and sequential Monte Carlo simulation is applied to assess system adequacy. The proposed method, when deployed on the modified IEEE RTS 24-bus system, shows that coordinated DLR and ESS planning improves transmission capability, mitigates congestion, and strengthens system adequacy under weather variability.
comment: To appear in Proceedings of IEEE PES GM 2026
Decentralized Autonomous Traffic Management through Corridor Networks
As autonomous aircraft are introduced at scale and traffic density increases, centralized management becomes insufficient to coordinate the large numbers of crewed and uncrewed aircraft. Dedicated Advanced Air Mobility (AAM) corridors have therefore been proposed for organizing high-density autonomous traffic flows. The desire to scalably provide autonomous aircraft flexibility in trajectory planning motivates the development of decentralized approaches to traffic management in AAM corridors. In this work, we extend a multi-agent reinforcement learning (MARL) approach to address the challenge of decentralized traffic flow management in air corridor networks. We test policies trained in a single-corridor setting on increasingly complex multi-corridor networks with combinations of merges and splits in a zero-shot manner. Experimental results demonstrate that learned behaviors transfer well to scenarios with varying traffic density, network geometry, and heterogeneous vehicle performance, without needing centralized coordination or model retraining. We evaluate system-level performance in terms of conformance to corridor boundaries, completion rates, average speeds, distance traveled, and maintenance of inter-aircraft separation. We find that although our policies require only locally coordinated entry, traversal, and exit behaviors, they collectively produce desirable traffic flows through the corridor network.
comment: Presented at the Second US-Europe Air Transportation Research and Development Symposium (ATRDS2026)
Industrial electrification in the era of data centers: A Bayesian Optimization approach for grid-aware large load allocation
Large loads from industrial electrification and data centers are reshaping the planning and operation of the power grid. Identifying optimal large load siting decisions while accounting for transmission congestion is key to reducing expansion cost and operational risks. In this paper, we propose a leader-follower bilevel optimization framework to identify optimal large load allocation strategies. The leader determines the allocation of large loads, while the followers determine grid expansion cost and transmission utilization. This modeling approach explicitly integrates strategic planning with detailed short-term operational decisions. Moreover, we develop a Bayesian Optimization approach to efficiently solve the bilevel optimization problem by treating the followers as a black box. We use the framework to study large-scale load allocation from electrified oil refineries and data centers on a synthetic power grid that resembles key characteristics of the Texas (ERCOT) system. The results show that these large loads compete for electricity, and under high-load scenarios, data center demand is distributed across the entire grid, avoiding regions with high demand from industrial electrification.
A Relaxed Quadratic-Program-based Framework for Trajectory Tracking of Unicycle Robots with Singularity Avoidance
Dynamic feedback linearization (DFL) is a classical technique for trajectory tracking of unicycle-type mobile robots, but the resulting DFL-based controller becomes singular when the linear velocity vanishes, rendering standard DFL-based controllers unsuitable for stop-and-reverse maneuvers. This paper proposes a quadratic-program (QP)-based optimal control framework that avoids this singularity, while establishing local Lipschitz continuity of the resulting feedback law. Our approach reformulates the DFL constraints as an equality-constrained QP with a slack variable, ensuring feasibility for all states and reference signals, including at points where the robot's velocity vanishes. By introducing slack variables and tunable parameters, we demonstrate that the singular configuration can be avoided for a large class of reference trajectories. The effectiveness of the proposed approach for trajectory tracking is demonstrated through ROS 2-Gazebo simulations on a TurtleBot3 Waffle robot. The code is available at https://gradslab.github.io/DFL_QP_Unicycle/
comment: 6 pages, 4 figures, paper accepted at Conference of Control Technology and Applications (CCTA) 2026
A Hybrid Intrusion Detection System for Electric Vehicle Charging Infrastructure
The integration of Electric Vehicle Charging Stations (EVCSs) into the smart grid necessitates sophisticated digital infrastructure for their management and coordination, which expands the attack surface and makes both the power grid and EVCSs vulnerable to cyberattacks. This research addresses critical gaps in existing EVCS Intrusion Detection Systems (IDS) by proposing a hybrid IDS that integrates attack detection on both the cyber and physical layer of the EVCS ecosystem. The proposed hybrid IDS utilizes a dual-layer integration method, which combines network-based IDS (NIDS) and host-based IDS (HIDS). This approach facilitates for comprehensive monitoring of both network traffic through the NIDS and host-level activities via the HIDS, effectively addressing the unique challenges posed by the interconnected nature of EVCS ecosystems. Utilizing the recent CICEVSE2024 dataset, the IDS presented in this work performs multiclass classification across various attack types, including False Data Injection Attacks (FDIAs), reconnaissance, denial of service, backdoor, and cryptojacking attacks. Experimental results demonstrate that our approach achieves excellent detection accuracy, with the NIDS component reaching 99.99\% accuracy for network-based attacks and the HIDS component achieving 83.47\% accuracy on FDIA, cryptojacking, backdoor, all DoS, all Recon except Slowloris Scan attacks. This dual-layer detection significantly outperforms single-source detection approaches previously presented in literature.
comment: This work has been submitted to the IEEE for possible publication
Value iteration with stopping criterion: finite iterations, stability, and near-optimality guarantees
Value iteration (VI) is a cornerstone of dynamic programming that allows computing near-optimal feedback laws for general plant dynamics and cost functions. In practice, however, it must be stopped after finitely many iterations. This raises the question of when to stop the algorithm so that the resulting policies and value functions achieve desirable properties, like given near-optimality bounds and stability. In this context, we study deterministic, discrete-time systems with infinite-horizon (possibly discounted) costs whose inputs are generated by VI. We equip VI with a generalized stopping criterion that encompasses existing choices while allowing new ones. Our aim is to analyze the properties of the policies and value functions at the final iteration. Under mild assumptions, we first show that VI indeed terminates in a finite number of iterations. We then establish that the final policies are stabilizing by properly designing the stopping criterion, and derive explicit near-optimality bounds characterized by this choice. These results offer a design framework for the stopping criteria that balances computational effort with stability and performance guarantees.
comment: Preprint, submitted to IEEE TAC
Path-following Control of a Quadrotor using Quasi-Static Transverse Feedback Linearization
We propose a quasi-static transverse feedback linearization (QSTFL) controller for a quadrotor to follow a prescribed geometric path, rather than a time-parameterized trajectory. In contrast to existing dynamic-feedback approaches, the controller does not introduce additional controller states. The thrust input is computed algebraically from the current state, eliminating the need for thrust-derivative measurements and numerical integration. The proposed design renders the path-following manifold invariant, ensuring that trajectories initialized on the path remain on it for all future time, while simultaneously regulating tangential velocity and yaw. We establish a diffeomorphic coordinate transformation and prove local exponential stability of the path-following manifold. In addition, closed-form expressions are derived for the thrust and torque inputs. Compared with dynamic-feedback constructions, the controller requires inversion of only a $3\times 3$ decoupling matrix rather than a $4\times 4$ one, leading to a simpler control law and reduced computational complexity. Numerical simulations demonstrate the effectiveness of the proposed method. Code and animations are publicly available at \footnotesize{\texttt{\href{https://gitlab.com/a5akhtar/quasistatic-tfl-uav/}{https://gitlab.com/a5akhtar/quasistatic-tfl-uav/}}}.
comment: 6 pages, 4 figues, accepted at the European Control Conference (ECC) 2026
When Distortion Helps: Secure GNN Precoding with Nonlinear Power Amplifiers
Physical layer security (PLS) provides information-theoretic protection against eavesdropping. While existing techniques assume ideal linear transmitters, power amplifiers (PAs) in practice introduce nonlinear distortion, typically considered detrimental to signal quality. This paper demonstrates that such distortion can instead be exploited as a security asset by redirecting it toward eavesdroppers, particularly in the power-efficient PA saturation regime. To this end, we propose a graph neural network (GNN)-based precoding framework for multi-user multiple-input single-output (MISO) wiretap channels that maximizes the sum secrecy rate by exploiting PA nonlinearity. Since the resulting optimization is highly non-convex, classical methods are intractable. The GNN instead learns precoding strategies directly from legitimate users' channel data, requiring neither eavesdropper channel state information (CSI) nor dedicated artificial noise (AN) power allocation. For this, the Bussgang decomposition and a high-order polynomial PA model provide an analytical secrecy rate as the training objective. At 22 dB signal-to-noise ratio (SNR) under severe PA saturation with input back-off (IBO) $= -1$ dB, the proposed GNN achieves 39.89% and 35.26% higher sum secrecy rate over maximum ratio transmission (MRT) and zero-forcing (ZF), respectively, 17.99% over AN-aided MRT and 8.67% over AN-aided ZF, with 58.13-75.31% lower standard deviation across all baselines.
Non-intrusive nonlinear reduced-order modeling with variable projection
This work presents a method for constructing nonlinear reduced-order models from input-output time-domain data. The proposed approach, termed Mixed Interpolatory Inference with Variable Projection (MIIvp), exploits the fact that the considered class of nonlinear state-space models is linear in the output equation parameters. By applying the Variable Projection (VarPro) algorithm, the optimization is restricted to the state equation parameters alone, while the output equation parameters are recovered via linear least squares. As a consequence, the output dimension does not enter the nonlinear optimization parameter vector, making the method well suited for systems with very high-dimensional outputs, a setting where many other approaches become computationally prohibitive. Under mild assumptions, it is shown that MIIvp can recover the true model parameters up to similarity. The method is first validated on a synthetic bilinear system, where it achieves machine-precision accuracy and recovers the true eigenvalues. MIIvp is then compared with existing methods on two experimental benchmarks from the nonlinear system identification literature. These numerical experiments showcase both the validity and the limitations of the proposed approach. Finally, directions for improvements and future work are outlined.
Towards time-variant scenario reduction for energy system optimization modeling under uncertainty
Stochastic programming has become a popular tool for supporting decision-making under uncertainty in the long-term planning of energy systems. Existing scenario reduction methods, however, are naive about the long-term temporal nature of scenarios, which limits their efficiency in reducing model size. In this paper, we overcome this inefficiency by proposing a novel time-variant scenario reduction framework that explicitly allows for varying scenario aggregations over time. As a result, scenario probabilities become time-variant, enabling not only the accurate capture of scenario realizations but also their probabilities at the time steps that drive investment decisions. This substantially increases flexibility compared to traditional time-invariant methods, which we demonstrate on a two-stage stochastic generation expansion planning problem with uncertain renewable power production.
Robust Data-Driven Nash Equilibrium Seeking under Partial-Decision Information
This paper presents a data-driven framework for decentralized Nash equilibrium (NE) seeking in multi-agent systems with unknown linear dynamics subject to exogenous disturbances, operating under partial-decision information (where agents lack direct access to the decisions of all others) and equality constraints. The proposed framework integrates an NE model, a distributed communication protocol, an internal model for disturbance rejection, and a data-driven stabilization strategy. By reformulating the problem as a cooperative output regulation problem, we synthesize controllers directly from noisy input-state data via semi-definite programs (SDPs), providing formal guarantees for closed-loop stability and asymptotic convergence to the NE. The approach is further extended to a class of nonlinear systems with constant disturbances by leveraging integral control and describing nonlinearities via quadratic constraints. Numerical simulations involving unmanned aerial vehicle networks and a rotary-wing aerial vehicle formation validate the efficacy and robustness of the proposed method.
Scalable Online Flight Trajectory Optimization via Sequential Quadratic Programming for Urban Air Mobility in Ultra Low-Altitude Airspace ATC
As Urban Air Mobility (UAM) scales toward high-density operations, generating collision-free trajectories within complex 3D cityscapes is a critical safety requirement. This paper proposes a scalable Sequential Quadratic Programming (SQP) framework that integrates geometric environmental constraints, operational limits, and vehicle dynamics within a single online trajectory optimization process. Rather than precomputing obstacle-free corridors ahead of time, our method encodes obstacle avoidance as live separating-hyperplane constraints regenerated at every solver iteration, so that dense urban geometry and full-DOF vehicle dynamics are resolved jointly and online as the reference and environment evolve. A variable-scale quadtree decomposition keeps computation bounded, enabling the framework to scale to city-wide environments while preserving real-time performance for high-speed flight. We validate the framework against conventional SQP, Iterative Linear Quadratic Regulator, and Differential Dynamic Programming across flights in five real-world urban centers, attaining 100% success and clearance rates on CPU-only hardware.
comment: Accepted to AIAA DATC/IEEE Digital Avionics Systems Conference (DASC 2026)
ThermoLLM: Thermodynamics-Aware HVAC Control with Spatial-Semantic Knowledge Graph SP
Multi-zone HVAC control is a spatial decision problem in which indoor thermal evolution and control decisions depend not only on outdoor conditions and internal heat gains but also on zone layout, physical adjacency, and delayed thermal interactions across the building. Recent LLM-based HVAC controllers have shown that prompt-based control is feasible. However, these methods typically rely on task descriptions, observation values, short textual feedback, or unstructured retrieval, which limits their ability to reason about zone coupling, thermal response, and building dynamics. This paper presents a thermodynamics-aware LLM control framework for a five-zone EnergyPlus building simulation. The controller is grounded in a physics-informed spatial knowledge graph derived from Brick-style building semantics and linked with recent interaction history. At each control step, the model receives the current building state, graph-structured spatial context, and recent environment-controller history, enabling it to make decisions that reflect both building structure and short-term thermal evolution. We evaluate the framework against standard control baselines and several LLM-based alternatives. Results show that the proposed approach achieves the best overall energy-comfort trade-off and the lowest PMV violation while maintaining energy-efficient operation.
comment: 10 pages, 5 figures. Submitted to ACM SIGSPATIAL 2026
Adaptive Joint Beamforming and Fluid Antenna System Design for 6G ISAC
Fixed-Position Antennas (FPAs) are constrained by static physical topologies and struggle to adapt to rapidly varying wireless environments. By dynamically reconfiguring the antenna positions, Fluid Antenna Systems (FASs) introduce additional spatial Degrees of Freedom (DoF) for wireless optimization. This paper investigates the joint optimization of Fluid Antenna System (FAS) topology reconfiguration and active beamforming for mobile Integrated Sensing and Communication (ISAC) systems. To enable real-time decision making, an end-to-end optimization framework based on the Soft Actor-Critic (SAC) algorithm is proposed. Simulation results show that the proposed scheme achieves an online inference latency of only 4 ms. Compared to the widely used alternating optimization, it improves communication performance by 42%. Moreover, it achieves performance comparable to the SCA-SDR benchmark while requiring 57% fewer antennas, demonstrating superior hardware efficiency.
comment: 6 pages, 5 figures
Delayed Functional Observers for the Realization of Generalized Delayed Control Laws
Building on the collective advancements in the literature \cite{trinh1, trinh2, trinhnn26, trinhnam26, trinhnam1}, this paper proposes the design of delayed functional observers to asymptotically estimate a generalized delayed control law under significant input and output delays. This framework enables designers to extend the allowable bounds for input delays while ensuring that the observer-based control scheme stabilizes the system despite simultaneous mismatched input and output time-delays.
HERCULES: An Open-Source Simulation Framework for Heterogeneous Multi-Robot SLAM, Collaborative Perception, and Exploration
We present HERCULES, an open-source simulator and data-collection pipeline for heterogeneous multi-robot autonomy. Built upon the Unreal Engine 5 (UE5)-based simulators AirSim and Cosys-AirSim, HERCULES resolves key architectural limitations of prior frameworks to enable concurrent unmanned aerial and ground vehicle (UAV-UGV) operation in large-scale, photorealistic, dynamic environments. It introduces a new waypoint-tracking UGV controller that mirrors existing UAV control interfaces, and provides a shared navigation stack for mapping, traversability analysis, planning, and control across heterogeneous platforms. Expanding inherited sensor suites, it adds physics-based long-wave infrared (LWIR) cameras and configurable night-vision modes for degraded visual environments. HERCULES provides lightweight APIs, ROS 2 wrappers, and rigorous time synchronization across sensors and platforms, and brings state-of-the-art game-engine capabilities into robotics simulation, integrating intelligent agents such as pedestrians, traffic, and wildlife with high-fidelity dynamic phenomena, including fire, flooding, and crop disease spread. HERCULES runs in two modes: passively, replaying offline-designed trajectories to generate reproducible multi-modal datasets, and actively, running an online planner in closed loop from live observations. Our experiments in heterogeneous multi-robot SLAM, collaborative perception, and exploration, using both HERCULES-generated data and active closed-loop execution, demonstrate its utility for advancing heterogeneous multi-robot autonomy. We publicly release our source code, experiment code, documentation, and datasets, including a heterogeneous multi-robot SLAM benchmark collected with two UAVs and two UGVs across kilometer-scale desert, forest, and city environments, at https://lunarlab-gatech.github.io/HERCULES-website.
comment: 19 pages, 14 figures, and 12 tables
Development, Validation, and Benchmarking of a Multidisciplinary Semi-Analytical Model for Wave Energy Converters
Wave energy converters (WECs) require system-level techno-economic analysis to balance power production, cost, and survivability. Existing simulation tools are either too computationally costly for large-scale optimization or too narrow in disciplinary scope to support integrated design studies. This work presents MDOcean, a novel open-source WEC simulation framework for rapid early-stage design exploration, parametric analysis, and multidisciplinary optimization. MDOcean integrates hydrodynamics, dynamics, structures, and economics in a computationally efficient architecture based on analytical and semi-analytical methods that substantially reduce runtime while maintaining near-numerical accuracy. The framework includes an eigenfunction-based linear hydrodynamic solver, a quasi-linearized frequency-domain dynamics engine capable of modeling drag and saturation nonlinearities, a structural sizing module incorporating realistic yield, ultimate, buckling, storm, and fatigue design criteria, and a simple cost model for techno-economic assessment. Particular emphasis is placed on the linearized pseudo-spectral optimal control formulation, which extends frequency-domain constraint-handling approaches with a unified describing-function and analytical quadratically-constrained quadratic program framework. This formulation efficiently treats nonlinearities and constraints while preserving compatibility with optimization and frequency-domain analysis techniques. Validation and benchmarking demonstrate that MDOcean's 151 ms runtime is orders of magnitude faster than leading WEC simulation tools while maintaining agreement with higher-fidelity baselines to within a few percent in most cases. The framework also provides insight into limiting behaviors, scaling laws, subsystem interactions, and key tradeoffs governing WEC design and techno-economic performance.
comment: 75 pages, 45 figures. To be submitted to Applied Ocean Research journal. See code at https://github.com/symbiotic-engineering/MDOcean/tree/main and reproducibility package at https://calkit.io/symbiotic-engineering/mdocean
Distributionally Robust Joint Information and Mechanism Design for Multi-Area Power System Coordination
We study a continuous-time stochastic Stackelberg control problem in which a leader steers a system of strategic followers through two non-standard channels - the information structure and a transfer mechanism - rather than through the dynamics directly. The latent environment is a jump-diffusion; the leader commits to a Gaussian public-signaling channel whose belief consequences are tracked by a finite-dimensional projection filter (the exact filter being infinite-dimensional), together with a Groves transfer that aligns the followers' incentives. Under truthful disclosure, efficient behavior is a dominant-strategy best response, and the induced differential game admits saturated and bang-bang Nash feedback. We cast the leader's distributionally robust problem, over a relative-entropy ambiguity neighborhood, as a two-controller Isaacs equation; prove that incentive alignment collapses the bilevel Stackelberg problem to a single robust control problem with an exact first-order condition; and characterize the value function as the unique viscosity solution, with a verification theorem valid for the non-smooth bang-bang feedback and a semiconcavity result that renders the switching set Lebesgue-null. We instantiate the framework on resilient multi-area power-system coordination under extreme weather. Calibrated to the 2021 Winter Storm Uri, an Isaacs solve over ERCOT's near-islanded interconnection (a 0.82 GW tie, under 2% of peak) shows mutual aid removes about 8% of social cost, rising to roughly 30% under the FERC/DOE-recommended interregional transfer capability; a reserve-scheduling experiment shows that public disclosure lowers welfare cost by 37% under autarky and 48% under market coupling, and that information design and market coupling are complements under common (systemic) risk.
comment: 22 pages, 5 figures
A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
Efficient sorter diversion control of automated material handling systems (MHS) is critical for optimizing operational efficiency in large-scale warehouse environments. In this study, we use an inbound receiving sorter at a high-volume e-commerce warehouse as our primary use case, where the sorter diversion system relies on cost functions with static weight configurations that fail to adapt to highly dynamic system contexts, such as volume mode, congestion level, equipment physical status, and upstream/downstream dependencies. To address this real-time sorter diversion optimization challenge, we conducted a comparative study of three candidate hybrid machine learning frameworks: Linear Regression with Gradient Descent Optimization (LR+GDO), XGBoost with Bayesian Optimization (XGB+BO), and Bayesian Contextual Bandits (BCB). Model training and evaluation were enabled by leveraging a high-fidelity physics-aware emulator to overcome the cold-start problem and allow a safe transition from offline to online learning. We performed comprehensive evaluations including reward model predictive accuracy, contextual sensitivity, action distribution, and projected reward uplift. Our results demonstrate that while tree-based reward models offer slightly better predictive power, the BCB framework achieved overall higher performance with 2.03% reward uplift over the heuristic baseline. Furthermore, BCB exhibits several superior characteristics, such as its decisive time-optimal policy backed by Bang-Bang control theory, continuous online learning capability, strategic balance between exploration and exploitation, and significantly shorter inference latency. These results demonstrate the potential of the BCB framework for real-time control optimization in large-scale warehouse environments, motivating further investigation toward operational deployment.
comment: Accepted at 2026 IEEE International Conference on Mechatronics and Automation (IEEE ICMA 2026)
Learning the Koopman Operator using Attention Free Transformers
Learning Koopman operators with autoencoders enables linear prediction in a latent space, but long-horizon rollouts often drift off the learned manifold, leading to phase and amplitude errors on systems with switching, continuous spectra, or strong transients. We introduce two complementary components that make Koopman predictors more robust. First, we add an attention-free latent memory (AFT) block that aggregates a short window of past latents to produce a corrected latent before each Koopman update. Unlike multi-head attention, AFT operates in linear time and adds only $\approx$30k parameters ($3d^2 + T^2$, fewer than matched multi-head attention), yet captures the local temporal context needed to suppress error divergence. Second, we propose dynamic re-encoding: lightweight, online change-point triggers (EWMA, CUSUM, and sequential two-sample tests) that detect latent drift and project predictions back onto the autoencoder manifold. Across three benchmark systems -- Duffing oscillator, Repressilator, IRMA -- our model consistently reduces error accumulation compared to a Koopman autoencoder and matched-capacity multi-head attention. We also compare against GRU and Transformer autoencoders, evaluated both from initial conditions and with a 50-step context, and find that Koopman+AFT (with optional re-encoding) attains markedly lower long-horizon error while maintaining lower inference latency. We report improvements over horizons up to 1000 steps, together with ablations over trigger policies. The result is a fast, compact predictor that stays on the learned manifold over long horizons.
comment: 28 pages, 10 figures, 9 tables. Code: https://github.com/MohammedNagdi/Attended-Koopman
Flow-Corrected Thompson Sampling for Non-Stationary Contextual Bandits
We study non-stationary linear contextual bandits where the reward model drifts over time, rendering classical contextual bandit algorithms brittle because historical data becomes systematically biased. We propose Flow-Corrected Thompson Sampling (fcTS), a Bayesian method that reuses experience by transporting past rewards to the present using an explicit drift model and incorporating each transported observation with a confidence weight that reflects transport reliability. This yields a unified template that specializes in (i) linear parameter drift via online slope estimation and reward correction, (ii) periodic variation via phase-aware reuse across cycles, and (iii) recurring regime switches via changepoint detection and regime-specific posterior memory. The resulting posterior updates remain closed-form under a linear Gaussian model and can be implemented efficiently with truncated, incrementally updated sufficient statistics. Across five controlled case studies and a semi-synthetic portfolio-selection benchmark with multiple overlapping non-stationarities, fcTS outperforms standard forgetting-based baselines (discounting, sliding windows, and periodic restarts), with the largest gains in settings exhibiting recurring temporal structure. These results demonstrate that when non-stationarity is structured, correcting and reweighting historical observations can be substantially more sample-efficient than uniformly discarding them.
Welfarist Control Design -- How to fulfill the societal mandate in multi-agent control?
At the core of most socio-technical systems lies a scarce resource that is allocated among agents: highway lanes, public transit, road space, water rights, energy access, grid capacity, user attention, pollution rights, etc. With further automation of the underlying allocation processes, control engineers are increasingly tasked to make decisive assumptions regarding what society wants. In practice to date, design choices are largely driven by industry norms and conventions rather than a result of conscientiously responsible and ethical design. In this paper, we look at tools available to control engineers to design systems in a more principled manner in order to match the societal mandate. We consider three control design paradigms: online feedback optimization, control of Markov decision processes, and model predictive control. Beginning with aggregating individual agents' preferences into control design objectives, subsequently ensuring and certifying the fulfillment of those specifications, we argue that the feedback nature of control systems enables appropriate allocation of the shared resources in ways hitherto unparalleled.
A Kalman Filter-Based Tracking Loop Design for Real-Time Aerospace GNSS Applications with Minimum Pull-Out Probability
Kalman filter-based (KF-based) tracking loops are a powerful alternative to traditional phase-locked loops (PLLs) for Global Navigation Satellite Systems (GNSS) signal tracking. The primary advantage of the KF is its ability to incorporate high-fidelity models for receiver dynamics and clock errors, allowing the loop to adapt optimally to signal conditions. However, this theoretical optimality is often compromised in practice by the processing delays inherent in real-time systems with hardware correlators, which existing KF formulations typically neglect. This paper introduces a Modified Kalman filter (mKF) that overcomes this limitation specifically for hardware-based architectures. By reformulating the measurement update to be consistent with the processing delays, the proposed mKF maintains optimality in a practical implementation. We further present a systematic method for tuning both the process noise covariance matrix and the correlation time, based on an analytical expression for the pull-out probability (POP), which is validated through Monte Carlo simulation. The mKF is then validated with a GNSS signal simulator, both by post-processing baseband samples and on a real-time GPS receiver with hardware correlators. A direct equivalence between the mKF and a one-delay Digital PLL (DPLL) is established entirely in the digital domain. At equal noise bandwidth, the mKF matches the DPLL's phase error variance while achieving lower error in the higher-order states. Moreover, the mKF sustains lock at bandwidths inaccessible to the optimal one-delay DPLL under the same dynamic stress, positioning the proposed architecture as a robust and noise-efficient solution for high-dynamic aerospace GNSS applications.
comment: This is a revised version of the manuscript. V1 was previously posted on TechRxiv with DOI: 10.36227/techrxiv.176344172.21345819/v1 This work has been submitted to the IEEE for possible publication
Topological Online Learning for Displacement-based Formation Control
This paper addresses the problem of robust formation control by introducing Topological Online Learning for Displacement-based (TOLD) formation control, a real-time edge-level adaptation framework. Unlike conventional node-level robust controllers that regulate individual robot inputs without modifying the interaction topology, TOLD updates the interaction topology weights online to directly minimize formation distortion. Two strategies are proposed under the TOLD formation control framework: Online Gradient Flow (OGF) with unconstrained weights and Online Exponential Gradient Flow (OExpGF) with non-negative convex weights. Theoretical analysis establishes that, for single-integrator agents over directed graphs, OExpGF guarantees asymptotic consensus, while OGF ensures bounded formation distortion. Simulations with twelve robots under intermittent disturbances show 1.2%-33.14% median cumulative Root Mean Distortion Error reduction when augmenting TOLD with node-level controllers. Hardware experiments with Crazyflie 2.0 quadrotors demonstrate over 62% (OGF) and 31.4% (OExpGF) reduction in median formation distortion compared to fixed-weight consensus.
DISPCA : A hybrid iterative-sequential approach for the identification of errors-in-variables model of linear DAE systems
The dynamic behavior of numerous engineering processes is effectively characterized through differential-algebraic equations (DAEs), commonly referred to as descriptor systems. While substantial progress has been achieved in identifying dynamic models governed by ordinary differential equations (ODEs), limited research has addressed the identification of descriptor systems from measured data. This work presents a systematic methodology for identifying the DAE model of a linear descriptor system in discrete difference equation form under errors-in-variables (EIV) setting, where both input and output measurements are corrupted by random noise. The proposed methodology generalizes the identification framework to handle scenarios where the system contains multiple algebraic and different ordered differential relations. The key innovation involves a partial stacking procedure of lagged data matrix with a sequentially increasing lag window that identifies all the differential relations individually. This is preceded by an iterative estimation of the measurement error covariance matrix that is diagonal and heteroskedastic, under large sample conditions. The algorithm simultaneously estimates the number of differential and algebraic relations, observability indices and delay parameters of the differential equations, and all the model coefficients directly from measured data without requiring prior specification from the user. The framework addresses the increased complexity arising from multiple dynamic coupled interactions while maintaining computational tractability through systematic decomposition of the identification problem. Effectiveness of the proposed methodology is demonstrated through several simulation studies.
comment: 19 pages, 10 figures
Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors
The use of dedicated corridors for Advanced Air Mobility (AAM) traffic is one of the most commonly proposed pathways to integrating them into existing airspace operations. Most prior research has focused on the design of networks of AAM corridors and conflict resolution for aircraft within corridors. It is also generally believed that while attractive from an implementation perspective, corridor-based operations may be inefficient, especially in the absence of centralized traffic management. In this paper, we show that contrary to this belief, it is possible for autonomous aircraft to learn to self-organize into corridor flows in decentralized settings. We illustrate our approach using scenarios in which fixed-wing aircraft need to safely and efficiently traverse (1) a single corridor with metering after the exit, (2) a sequence of two consecutive corridors, and (3) a corridor that splits into two. We find that in decentralized settings with only local information, the aircraft are able to conform to the corridor boundaries more than 94% of the time and reach their goal in a relatively efficient manner. Furthermore, tactical interventions to handle violations of the separation minimum are needed only infrequently in low- and medium-density settings. However, such tactical interventions become more frequently necessary only when traffic density is high.
comment: Presented at the AIAA SciTech 2026 Forum
Rethinking the green power grid for stability, not just for climate
The 2025 Iberian blackout has renewed concerns about the resilience of power grids with high shares of renewable generation. This commentary argues that renewable generation can not only advance decarbonization but also strengthen grid stability through synthetic inertia, advanced inverter-based control, and coordinated transmission planning. Rapid advances in energy storage and power electronics make this transition increasingly viable.
comment: Code are available at our GitHub repository https://github.com/YimingSci/EUPG-Renewables
Automating the Wildfire Detection and Scheduling Pipeline with Maneuverable Earth Observation Satellites
Wildfires are becoming increasingly frequent, with potentially devastating consequences, including loss of life, infrastructure destruction, and severe environmental damage. Low-Earth-orbit satellites equipped with onboard sensors can capture critical information related to active wildfires and enable near-real-time detection through machine learning algorithms applied to the acquired data. We propose a framework that automates the complete wildfire detection and satellite scheduling pipeline, entitled the WildFire-applicable Intelligent and Responsive Ensemble for Detection and Scheduling (WildFIRE-DS). This paper develops an algorithm to realize the vision of the WildFIRE-DS as a proof of concept, integrating three key components: wildfire detection in satellite imagery, statistical updating that incorporates data from repeated flyovers, and multisatellite scheduling optimization. The algorithm enables wildfire detection using convolutional neural networks with sensor fusion techniques, incorporates subsequent flyover information via Bayesian statistics, and schedules a constellation of satellites using the state-of-the-art Reconfigurable Earth Observation Satellite Scheduling Problem. Simulated experiments conducted using real-world wildfire locations and the orbits of operational Earth observation satellites demonstrate that this autonomous detection and scheduling approach effectively enhances wildfire monitoring capabilities.
comment: 46 pages, Journal of Aerospace Information Systems (Published)
Trust-as-a-Service: Intelligent Collaboration Orchestration via Model Context Protocol-Aided Agentic AI
As future networked systems increasingly rely on collaborative task execution among distributed devices, trust becomes essential for identifying reliable collaborators whose capabilities and resources match task-specific needs. However, diverse task needs, limited task-owner knowledge, and complex inter-device relationships make it challenging to evaluate the trustworthiness of potential collaborators and to select suitable collaborators for task completion. To address these challenges, this paper proposes Trust-as-a-Service (TaaS), an intelligent collaboration orchestration paradigm that enables trust evaluation and collaborator selection to be autonomously tailored to different task needs. To realize TaaS, we develop a Model Context Protocol (MCP)-aided agentic AI framework. The central server-side agent autonomously performs trust-related operations according to task-specific needs and delivers trust assessment services to task owners through a unified interface. Meanwhile, device-side agents expose their capabilities and resources via MCP servers, allowing devices to be dynamically discovered, evaluated, engaged, and released to form task-specific collaborative units. Experimental results demonstrate that the proposed TaaS achieves 100\% collaborator selection accuracy, along with high reliability and resource-efficient task completion.
Koopman Model Dimension Reduction via Variational Bayesian Inference and Graph Search
Koopman operator recently gained increasing attention in the control systems community for its abilities to bridge linear and nonlinear systems. Data driven Koopman operator approximations have established themselves as key enablers for system identification and model predictive control. Nonetheless, such methods commonly entail a preselected definition of states in the function space leading to high dimensional, overparameterized models that may suffer from poor numerical conditioning and degraded long term prediction performance. We address this problem by proposing a hierarchical probabilistic approach for the Koopman model identification problem. In our method, elements of the model are treated as random variables and the posterior estimates are found using variational Bayesian (VB) inference updates. Our model distinguishes from others in the integration of inclusion flags. By the help of the inclusion flags, we intuitively threshold the probability of each state in the model. We then propose a graph search based algorithm to reduce the preselected states of the Koopman model. We demonstrate that the proposed reduction improves numerical conditioning and can preserve or improve prediction performance while substantially reducing the dictionary size.
comment: 23 pages, double column
A Generalized Plant Perspective on Linear-Convex Feedback Optimization
Feedback optimization is a control approach for driving a dynamical system to the solution of an optimization problem by interconnecting the plant with an algorithm. Existing stability guarantees typically rely on timescale separation, enforced by conservative gain bounds that limit transient performance and require a pre-stabilized plant. This paper revisits the robust control perspective on feedback optimization. We formulate the plant-optimizer interconnection as a generalized plant, where the cost gradients are characterized by Zames--Falb Integral Quadratic Constraints. Classical timescale-separation bounds are recovered as a special case of static multipliers, with dynamic multipliers yielding substantially tighter stability margins. The formulation also enables IQC based synthesis of dynamic output feedback controllers that jointly stabilize the plant and optimize transient performance, with possible model uncertainty absorbed into an uncertainty channel. For constrained problems, the framework extends to dynamic controllers that generalize projected gradient flows. Numerical examples illustrate the benefits and flexibility of the proposed approach.
Coupling Scenario-Based Grid Simulations with State Estimation: Measurement Requirements for Low-Voltage Networks under the German Energy Transition Pathway
Increasing penetration of electric vehicles, heat pumps, and rooftop photovoltaics is creating thermal and voltage stress in low-voltage distribution grids. This work links the German Federal Government energy transition pathway (2025-2045) with state estimation performance requirements, evaluated at five milestone years from 2025 to 2045 on two SimBench reference networks across three equipment quality levels (good, medium, poor) and three VDE Forum Netztechnik/Netzbetrieb (VDE FNN) measurement constellations that differ in the availability of transformer and feeder-level instrumentation. Within this work's analysis, congestion is caused exclusively by transformer overloading and voltage-band violations. No individual line exceeds its thermal rating (maximum: 98.6%). Equipment quality governs congestion onset for a given deployment trajectory: under good equipment, congestion remains absent through 2045, under medium equipment it emerges from 2035 (4 of 10 scenarios), under poor equipment from 2025 (9 of 10). Without transformer instrumentation, median voltage estimation errors reach 6-42% regardless of smart meter penetration. Adding a single transformer measurement reduces errors by an order of magnitude, achieving median errors of 0.5-1.4%. In urban networks, transformer-level instrumentation meets the VDE FNN voltage accuracy target (99th percentile voltage error below 2%) in all configurations. In rural networks under poor equipment, the target is approached but not met. These findings motivate prioritizing transformer instrumentation as an effective first step for grid observability and supplementing the current consumption-driven metering rollout with risk-based deployment criteria linked to local congestion exposure.
From droop to optimality: The potential of volt/var control for power distribution grid enhancement
When high amounts of active power are injected into power distribution grids, the overall power flow is limited because voltages reach their upper acceptable limits. Volt/var control aims to raise this power flow limit without physically reinforcing the grid but by controlling the voltage using reactive power. We use real consumption and generation data on a low-voltage CIGRÉ grid model and an experiment on a real distribution grid feeder to analyze how different volt/var methods can enhance the grid. We show that local droop control enhances the grid but underutilizes the reactive power resources. We discuss how this inefficiency can be partly reduced by fine-tuning the droop curves through data-driven techniques but illustrate that inherent trade-off persist for any local control method. We finally demonstrate that coordinated control methods can track the optimal solution and enhance the grid to its full potential if grid-wide communication is available. Our numerical study over a whole year of real data suggests that coordinated volt/var control can enable another 10.4% of maximum active power injections compared to droop control. In a small-scale real-life experiment, coordinated control enhanced the grid by the same amount.
comment: Published in Sustainable Energy, Grids and Networks, Vol. 46, Article 102379 (2026)
Boundary adaptive observer design for semilinear hyperbolic rolling contact ODE-PDE systems with uncertain friction
This paper presents an adaptive observer design for semilinear hyperbolic rolling contact ODE-PDE systems with uncertain friction characteristics parameterized by a matrix of unknown coefficients appearing in the nonlinear (and possibly non-smooth) PDE source terms. Under appropriate assumptions of forward completeness and boundary sensing, an adaptive observer is synthesized to simultaneously estimate the lumped and distributed states, as well as the uncertain friction parameters, using only boundary measurements. The observer combines a finite-dimensional parameter estimator with an infinite-dimensional description of the state error dynamics, and achieves exponential convergence under persistent excitation. The effectiveness of the proposed design is demonstrated in simulation by considering a relevant example borrowed from road vehicle dynamics.
comment: 11 pages, 3 figures. Under review at Automatica, 3rd review round
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026). Copyright transferred to IEEE. Sample code for the navigation example with CBF-RL reward core construction can be found at https://github.com/lzyang2000/cbf-rl-navigation-demo
An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays
In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.
comment: 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice
From Zonal to Nodal Capacity Expansion Planning: Spatial Aggregation Impacts on a Realistic Test-Case
Solving power system capacity expansion planning (CEP) problems at realistic spatial resolutions is computationally challenging. Thus, a common practice is to solve CEP over zonal models with low spatial resolution rather than over full-scale nodal power networks. Due to improvements in solving large-scale stochastic mixed integer programs, these computational limitations are becoming less relevant, and the assumption that zonal models are realistic and useful approximations of nodal CEP is worth revisiting. This work is the first to conduct a systematic computational study on the assumption that spatial aggregation can reasonably be used for ISO-scale CEP. By considering a realistic, large-scale test network based on the state of California with over 8,000 buses, we find that well-designed small spatial aggregations can yield good approximations but that coarser zonal models may result in large distortions of investment decisions, e.g., capacity under-investment of up to 41% for the lowest resolution model considered.
comment: 10 pages, 4 figures, 6 tables
From Singleton Obstacles to Clutter: Translation Invariant Compositional Avoid Sets
This paper studies obstacle avoidance under translation invariant dynamics using an avoid-side travel cost Hamilton Jacobi formulation. For running costs that are zero outside an obstacle and strictly negative inside it, we prove that the value function is non-positive everywhere, equals zero exactly outside the avoid set, and is strictly negative exactly on it. Under translation invariance, this yields a reuse principle: the value of any translated obstacle is obtained by translating a single template value function. We show that the pointwise minimum of translated template values exactly characterizes the union of the translated single-obstacle avoid sets and provides a conservative inner certificate of unavoidable collision in clutter. To reduce conservatism, we introduce a blockwise composition framework in which subsets of obstacles are merged and solved jointly. This yields a hierarchy of conservative certificates from singleton reuse to the exact clutter value, together with monotonicity under block merging and an exactness criterion based on the existence of a common clutter avoiding control. The framework is illustrated on a Dubins car example in a repeated clutter field.
Robotics
Line Drawings using LightBenders: Authoring and Illuminating
This study presents the hardware and software architecture of a transformative system for illuminating line drawings and letterforms. These mid-air illuminations are indoors and might be animated. The hardware contribution is a drone equipped with servo-actuated rod joints and a dense, addressable LED strip that enables arbitrary orientation, a LightBender. The software contributions are threefold. First, the system implements algorithms and heuristics to estimate the minimum number of LightBenders required to render a line drawing or letterform, stagger swarm formations to mitigate LightBender downwash, generate Swarm Flight and Lighting (SFL) files, and execute these files using a swarm of LightBenders to illuminate line drawings and letterforms. Second, a Blender add-on enables users to register LightBenders, author graphics and animations represented by swarms of LightBenders, and deploy the swarm for illumination through one-click functions. Third, users may import SVG files into either the Blender add-on or a standalone LB-Author tool to illuminate line drawings directly from vector graphics. We present results from an IRB-approved human subject study (n=21) to evaluate the impact of LightBender misalignment on the perceived illuminations. Obtained results demonstrate that the system's 10.1 mm maximum misalignment is perceptually acceptable across tested illuminations, with a median quality rating of 8 on a 0-10 scale.
PenduMorph: Development and Motion Analysis of Pendulum-Actuated Rolling Reconfigurable Spherical Robot with Magnetic-Coupling
This paper presents "PenduMorph", a wireless reconfigurable rolling spherical robot designed as a modular platform for enclosed locomotion and inter-module interaction in challenging environments. The proposed robot extends our previous pendulum-actuated rolling disk concept to a fully enclosed spherical architecture integrating a 2-DoF internal pendulum, onboard control, battery-powered operation, and magnetic docking. The design aims to combine independent rolling mobility with protected hardware and reliable reconfigurability. We first present the robot design and an analytical study of the magnetic coupling mechanism to evaluate retention and interaction between coupled modules. We then experimentally investigate key motion behaviors at both the single-module and dual-module levels, including independent rolling, magnetic coupling, and coordinated coupled motion. The results show that the proposed platform enables stable wireless operation and a set of distinctive reconfigurable rolling behaviors, providing a useful foundation for future modular spherical robots operating in contact-rich and demanding environments.
comment: 7 pages, Accepted to TAROS2026, Manchester
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation
Learning visuomotor policies for long-horizon manipulation remains a fundamental challenge. Recent skill-based imitation learning methods based on discrete quantization have shown promising results by representing complex behaviors as temporally extended skills. However, most existing approaches primarily encode action trajectories into latent skills, yielding weak visual-semantic grounding and limiting the ability to leverage visual observations for skill selection. Moreover, discrete tokenization inevitably incurs precision errors during continuous action generation. To alleviate these issues, we propose Aligned Refinement Policy (ARP), a discrete-skill framework that couples semantic grounding with execution-level refinement. Specifically, ARP introduces (i) a visual--action alignment objective that contrastively aligns visual embeddings with pre-quantized action representations in a shared latent space while preserving a state-independent skill decoder, and (ii) a lightweight Iterative Residual Head (IRH) that performs a two-step refinement to recover fine-grained control for precise execution. Extensive experiments show that ARP achieves state-of-the-art performance on the LIBERO and Meta-World benchmarks. Moreover, real-robot experiments on the Kuavo 4 Pro humanoid platform further validate its effectiveness, yielding consistent performance gains over several baselines on two challenging manipulation tasks.
Scalable Multi-Task Data Generation via Reinforcement Learning for Language-Conditioned Bimanual Dexterous Manipulation
A key bottleneck in training generalist policies for bimanual dexterous manipulation is the lack of large-scale, high-quality datasets. Synthetic data generation in simulation provides a scalable alternative to human video demonstrations by overcoming challenges such as morphology mismatch, missing physical interactions, and the generation of robot actions. However, existing approaches based on human teleoperation offer limited task diversity, as object-centric trajectory matching often neglects the feasibility of robot execution. Reinforcement learning (RL) enables broader scalability but is often constrained by handcrafted, task-specific rewards. In this work, we propose a systematic RL-based data generation pipeline that integrates generalizable reward design, effective domain randomization, and language-conditioned task annotations. This pipeline synthesizes diverse, high-quality datasets for dexterous bimanual manipulation and enables training of language-conditioned multi-task policies. Our experiments show that the generated data significantly improves generalization across three representative manipulation tasks.
BLENDS: Bayesian Learning-Enhanced Deep Smoothing for GNSS-Denied Environments
Maintaining accurate navigation during GNSS outages remains a significant challenge for autonomous systems relying on low-cost inertial sensors. While classical smoothing methods, such as the two-filter smoother and Rauch-Tung-Striebel smoother, exploit measurements collected before and after an outage, their performance remains limited by the accuracy of conventional GNSS measurements. This paper presents Bayesian learning-enhanced navigation with deep smoothing (BLENDS), a transformer-based framework that augments Bayesian smoothing with learned covariance adaptation and state correction. The proposed method preserves the statistical foundations of Bayesian estimation while leveraging data-driven learning to improve navigation accuracy. Evaluations on the quadrotor dataset with GNSS outages demonstrate that BLENDS consistently outperforms both model-based smoothers, achieving up to 25.6% improvement in the position root mean square error while also reducing estimation uncertainty. Furthermore, BLENDS learns to compensate for the systematic bias between conventional GNSS positioning and RTK ground truth, enabling accuracy beyond that achievable with conventional GNSS measurements alone. The results demonstrate the potential of learning-enhanced Bayesian smoothing for resilient and high-accuracy navigation in GNSS-challenged environments.
Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence
Current embodied world models are primarily optimized for predictive objectives, limiting their ability to generalize under distribution shifts and reason systematically about unseen situations and hypothetical interventions. We argue that embodied intelligence should move beyond predictive world modeling toward self-evolving cognitive systems that continually construct and refine internal causal representations through interaction with the environment. To this end, we propose a self-evolving cognitive framework via causal world modeling for embodied scientific intelligence, which integrates three complementary components: causal world modeling, intervention-driven causal reasoning, and continual cognitive refinement. The proposed framework continuously revises and expands its internal causal world model through causal discovery, intervention-driven feedback, and counterfactual reasoning, supporting continual cognitive refinement and enabling cognition itself to evolve over time. Furthermore, we reinterpret embodied interaction not merely as a means of trajectory optimization, but as an epistemic process for causal hypothesis generation, intervention-driven experimentation, and continual knowledge acquisition. This work provides a conceptual and theoretical foundation for a transition from predictive intelligence toward epistemic intelligence, in which intelligence emerges through the continual construction, revision, and refinement of causal world models via interaction with the environment. Accordingly, an intervention-driven causal-epistemic benchmarking paradigm is suggested for evaluating self-evolving embodied scientific intelligence.
comment: 18 pages
SPiralRoll: A Novel Adjustable-Stiffness Underactuated 3-DoF Joint with Torsion Springs for Rolling Robots
Compliant mechanisms are important in robotics because they can improve adaptability, safety, and energy efficiency while reducing hardware complexity. This paper presents SPiralRoll, a novel torsion-spring-based underactuated compliant mechanism for rolling robots and compliant robotic actuation. The mechanism uses arc-distributed elastic members and two motor inputs to realize three physically observable output motions: rotational motion, radial expansion/contraction, and axial spin induced by nonlinear compliant deformation. Two configurations, namely full-arc and single-arc designs, are developed and experimentally evaluated. Beyond benchtop validation, the mechanism is integrated into a spherical rolling robot, where proof-of-concept experiments demonstrate forward rolling and turning. The results show that the full-arc design provides better structural support and smoother deformation, whereas the single-arc design yields larger deformation and stronger inertial excitation, making it more suitable for pendulum-driven rolling locomotion. Overall, SPiralRoll provides a low-cost, compact, and fully 3D-printable solution for underactuated compliant rolling robots and adaptive robotic joints.
comment: 6 pages, Accepted to TAROS2026, Manchester
Curvature-aware 3D length estimation of greenhouse cucumbers using RGB-D imaging and cubic spline arc-length integration
Commercial greenhouse cucumber production is graded by fruit length, which drives harvest scheduling, labour allocation, and logistics. Manual measurement with thread or caliper is accurate but infeasible at commercial scale. This paper presents CucumberVision, a non-contact length estimation framework using an Intel RealSense D435 RGB-D camera. A YOLO26n instance segmentation model locates cucumbers, and SAM (ViT-B backbone) refines each detection to a pixel-precise mask. Five methods are evaluated under matched conditions: (M1) a dominant-axis skeleton scan-line baseline; (M2) PCA on the bounding-box depth point cloud; (M3) SAM mask with medial-axis skeletonisation; (M4) a hybrid keypoint-guided approach using a YOLO26-pose model predicting five anatomical landmarks (KP0--KP4) with piecewise 3D arc-length; and (M5) a novel medial arc spline method fitting a cubic spline through the 3D medial axis of the SAM mask and computing arc length by trapezoidal integration -- the first such application to elongated vegetable measurement. All methods share five-frame burst depth averaging, colour-stream intrinsic alignment, and adaptive method selection with cascading fallbacks ensuring 100% coverage. A benchmark of 48 captures across seven cucumbers in three size categories (small ~8 cm, medium ~13 cm, large ~25 cm) with thread-based ground truth establishes a significant accuracy hierarchy: M1 (MAPE 9.68%) > M2 (5.31%) > M4 (5.51%) > M3 (5.82%) > M5 (4.13%). M5 significantly outperforms all competitors at Bonferroni-corrected alpha=0.0125. A secondary contribution is identifying a 12--18% length underestimation caused by using depth-stream rather than colour-stream intrinsics after rs.align(rs.stream.color) -- an under-reported error source. The complete system is released open source and runs in real time on a single consumer-grade GPU.
Do Rigid-Body Simulators Dream of Soft Robots? Learning Contact-Rich Manipulation for Tendon-Driven Continuum Robots
Learning contact-rich, whole-body manipulation for soft continuum robots is held back by the lack of simulation infrastructure that has accelerated rigid-robot manipulation. Existing soft robot simulators are physically grounded but lack the contact handling, actuation support, or learning integration needed for contact-rich manipulation; rigid-body approximations offer these capabilities but sacrifice physical grounding. We bridge this gap for tendon-driven continuum robots (TDCRs) by deriving a continuum-mechanics-informed discretization that places the soft robot natively inside MuJoCo, unifying tendon forces, body contact, and dynamics in a single physics pipeline. We validate the simulator against a Cosserat rod reference (static and dynamic) and real TDCR hardware. We then train state-based imitation learning policies via teleoperation in simulation and deploy them zero-shot to a physical 3-segment TDCR on a 7-DoF Franka arm across two contact-rich manipulation tasks. To our knowledge, this is the first demonstration of sim-to-real transfer for contact-rich manipulation with continuum robots.
comment: Project Page: https://continuumroboticslab.github.io/opencr-mujoco/
Reference-Free Assessment of Physical Consistency in World Model-based Video Generation CVPR 2026
We introduce reference-free measures for evaluating the physical consistency of generated videos, combining relative and absolute approaches to assess fidelity. Although tools like WorldGym or WorldEval enable robotic simulation via video generation, physical fidelity gaps often prevent these environments from accurately reproducing real-world task success rates of VLA models. Unlike existing evaluation methods, which require costly human voting (Elo) or unavailable ground-truth references (FVD), our approach utilizes DROID-SLAM and SEA-RAFT to quantify physical inconsistencies, motivated by WorldScore. Videos filtered using our relative consistency assessment show an improvement in task success rates of over 8%, effectively narrowing the simulation-to-reality gap. Furthermore, our absolute assessment enables spatio-temporal localization, providing visualization of when and where physical artifacts occur.
comment: Accepted to the 2nd 3D-LLM/VLA Workshop, CVPR 2026
A Taxonomy of Conceptual Alignment in Human-Robot Dialogue
Successful conversations require speakers to align on the meaning of concepts, a challenging but crucial task for human-robot interaction. Understanding the process of establishing such alignment is hindered by competing interpretations of the term and isolated, unidirectional investigations of its design space. This paper argues for a design-centric understanding of conceptual alignment as a bidirectional and co-constructive process. We introduce a taxonomy that characterizes conceptual alignment dialogues along what triggers its initiation and what level(s) of conceptual understanding it concerns. We further present a dialogue act schema as an operational tool that captures the interactional moves through which alignment is achieved. Together, these contributions provide a structured foundation for analyzing, comparing, and designing conceptual alignment in human-robot interaction.
comment: 8 pages, 2 figures. To be presented at RO-MAN 2026
Benchmarking Robot Memory Under Interference
Robots deployed in realistic settings will accumulate experience across many sessions and tasks over their deployment. The robot's tasks may often require it to remember information from multiple sessions ago, making long-context robot memory important for real-world deployments. However, most robot-memory benchmarks today are based on single episodes or a short context. To measure how current robot memory systems perform on longer sessions with more distractions, we introduce RoboMME-Interference, a cross-session benchmark built on RoboMME. For each query episode, we construct a session history using the query's relevant prior demonstration followed by a controlled number of unrelated sessions, which we provide to the VLA as memory and measure accuracy. Running RoboMME's released memory-augmented $π_{0.5}$ variants unmodified through this benchmark, we find that while perceptual memory variants improve success when given the history without any distractors, they decay strongly and steadily as unrelated sessions accumulate. With this release, we emphasize the importance of long-context memory and robustness to interference and show that current systems largely fail on such capabilities. The project page, videos, code, and data are at https://robotmemorybench.com.
comment: 7 pages, 4 figures
Multi-AUV Marine Life Tracking with Single Hydrophone Payloads via a Hidden Markov Model Equipped Particle Filter
Researchers tag and track marine animals to study migration patterns, human impacts on behavior, and behavioral shifts due to climate change. Accurate data collection often requires tagging individual animals to collect spatio-temporal state estimates of the animal's geo-position and depth. Acoustic transmitters are prominent due to their continuous communication without requiring retrieval or surfacing to collect data. These transmitters emit underwater acoustic pulses that can be detected by hydrophones. However, the frequent movement of aquatic animals results in high data loss when the animal moves out of the detection range of a stationary hydrophone. Autonomous underwater vehicle (AUV) systems offer a solution for localizing transmitters with higher resolution over longer periods of time. Such systems previously deployed have often required multiple hydrophones mounted on a large frame carried by the AUV. This increases drag, limiting the speed at which the AUV can track highly mobile animals. This work provides an alternative by equipping multiple AUVs with a single compact hydrophone payload. A particle filter algorithm equipped with a hidden Markov model (HMM) behavioral motion model fuses measurements from multiple AUVs to estimate the transmitter's position. Real-world data shows a root mean square error (RMSE) of approximately 10 meters for short-term deployments, and a larger simulated dataset shows an RMSE of approximately 15 meters for longer deployments over a larger area. The HMM fit to historical animal movement data outperforms a generic velocity motion model, and both outperform a baseline random walk motion model.
Tactile Genesis: Exploring Tactile Sensors at Scale for Learning Dexterous Tasks
Tactile sensing is critical for contact-rich dexterous manipulation, yet it remains unclear which tactile abstractions a policy needs and when richer tactile fields justify their hardware cost. This is hard to study empirically: each sensor effectively defines a new robot, and no lab can replicate the same learning experiment across all of them. We present Tactile Genesis, a GPU-parallel tactile sensor simulation platform that exposes binary contact, contact depth, per-taxel kinematic force/torque, elastomer marker displacement, geometry-aware proximity, contact audio, and a voxelized temperature field (the first of its kind in robot learning physics simulation platforms) under a common interface, with configurable placement, resolution, and a realistic noise model (drift, hysteresis, dead taxels, crosstalk). It scales past 20,000 parallel environments and 1,000 taxels on a single GPU, improving throughput by 3 to 20 times over previous tactile simulators. We train teacher-student policies on three dexterous tasks, ablating sensor type, placement, resolution, and noise, and verify transfer to the real XHand1. Proprioception alone is insufficient on every task. Sensor placement dominates sensor type: fingertip-only coverage trails whole-hand coverage by a wide margin, while adding the palm and proximal phalanges closes most of the gap to the privileged teacher. Resolution matters far less than coverage: placing 200 taxels across the whole hand suffices across tasks. We find that force/torque per taxel is consistently the most useful sensor type. These results give concrete guidance for both future tactile hardware design for improving robot hands and policy-side observation choice in dexterous manipulation. https://neuroagents-lab.github.io/2026-tactile-genesis/
comment: 24 pages, 8 figures, 12 tables
EmbodiedUS-FS: Fast Slow Intelligence for Ultrasound Robotics
Robotic ultrasound scanning in real clinical environments requires both high-level clinical workflow reasoning and low-level closed-loop execution. Physicians natural-language instructions often contain implicit anatomical targets, procedural logic, image-quality requirements, and safety constraints, while execution is affected by patient motion, contact variations, and target drift. We propose a fast and slow hierarchical embodied ultrasound system for safe and interpretable robotic ultrasound assistance. The Slow Brain performs intent parsing and stage-wise task planning with knowledge augmentation from an API and handbook corpus, and generates executable plans through task-graph construction and structured plan verification. The Fast Brain fuses multimodal feedback, including ultrasound images, robot pose and force states, and patient-motion information, to refine local actions and perform image-quality-guided recovery behaviors. The system further integrates a Safety Shield and a hierarchical escalation policy to constrain risky actions and trigger replanning or human confirmation under persistent failures or safety-bound violations. Experiments on planning evaluation, closed-loop execution under dynamic perturbations, and safety-mechanism validation demonstrate that the proposed hierarchical design improves task success rates while reducing safety violations.
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation
Real-world reinforcement learning for robotic manipulation remains challenging, and this difficulty is amplified for flow matching policies: applying policy gradient methods to these policies is fundamentally limited by the need to backpropagate through time(BPTT) along the multi-step ODE that maps noise to actions, which is computationally prohibitive and numerically fragile. We propose FlowDPG, a DDPG-style method specifically designed for flow matching policies that distills the critic gradient into the velocity field at training time, bypassing BPTT entirely. Intuitively, FlowDPG combines two complementary vectors: the demonstration-driven velocity that keeps the action feasible, and the critic-driven correction that steers it toward higher value. Our contributions are threefold: (1) a BPTT-free distillation framework that enables stable DDPG-style policy improvement on flow matching policies, (2) a formal connection between the FlowDPG update direction and vanilla Deterministic Policy Gradient via three explicit approximations, and (3) real-world validation on a long-horizon, multi-stage, dual-arm AirPods assembly task, where FlowDPG attains a 92% end-to-end success rate, substantially outperforming recent RL methods spanning value-conditioning, auxiliary-module adaptation, and adjoint-based critic-gradient approaches. Videos and more results are provided on the project page https://flowdpg.github.io.
Any-Body Guard: Universal Safeguarding for Manipulation Policies via Action Masking
Ensuring safety of learning-enabled robotic manipulation across diverse embodiments and tasks still requires significant manual engineering. Existing approaches typically rely on heuristically designed fallback controllers or complex forward invariance assessments. These methods are often too conservative for task success, too computationally expensive for real-time execution, too heuristic to provide useful safety guarantees, or too engineering-heavy to transfer between setups. In this paper, we propose a universal safeguarding approach, X-Safe, which reasons directly in the robot's configuration space to provide formal probabilistic guarantees for collision avoidance. By operating in the configuration space, our method transfers across embodiments while relying solely on an object-based, quasi-static scene representation and a forward kinematics model of the robotic manipulator. Thus, X-Safe provides useful formal safety guarantees without requiring additional data, or engineering effort for different embodiments or scenes. We demonstrate X-Safe for diverse embodiments and policies, both in simulation and on hardware. We observe less degradation in task performance compared to state-of-the-art safeguarding, no collisions on hardware experiments, and empirically corroborate our formal guarantees.
Integrated cloud-based architecture for robot-robot and human-robot collaboration using ROS 2--MQTT in Mediterranean Greenhouses
The imperative to develop more sustainable agriculture demands a transition from isolated automation toward the deployment of multi-robot systems (MRS) in agrifood environments. However, Mediterranean greenhouse settings-characterized by narrow corridors, dense biomass, and structural metallic interference-pose significant challenges for robust and scalable communication between agents. Traditional robotic frameworks, such as ROS 2, frequently encounter node discovery issues and latency spikes due to dynamic obstacles, dense foliage, and other characteristic greenhouse elements, creating a critical bottleneck for real-time coordination. This paper proposes an innovative cloud-based hybrid architecture that establishes a two-way communication bridge between ROS 2, acting as an edge computing platform, and iVeg as a Decision Support System (DSS), using MQTT and the European FIWARE platform. The proposed framework enables seamless interoperability between fleets of multiple robots in environments with communication constraints, facilitating the synchronised exchange of high-level telemetry, point cloud data and farmer identification for collaboration, amongst other critical data. The architecture was validated in a high-fidelity simulation environment and subsequently tested in a real-world greenhouse scenario, demonstrating its ability to maintain persistent connectivity and data integrity under adverse network conditions. The results indicate that the integration of MQTT effectively eliminates information silos, providing a scalable and decentralised solution for managing complex robotic missions, which are executed locally via Edge Computing. This work sets a new methodological precedent for the concept of Greenhouse Models as a Service (GMaaS), bridging the gap between low-level robotic control and high-level, cloud-based IoT decision-making.
Efficient Continuous Semantic Mapping based on Spatio-Temporal Awareness
Continuous semantic mapping allows autonomous robots to understand both the spatial structure and the semantic content of complex environments. However, most existing methods process the entire space, treat voxels as independent units, and do not keep the semantic labels consistent over time. This leads to high computational cost and reduced robustness in dynamic scenes. This paper proposes a semantic mapping method that brings spatial and temporal relationships into the semantic inference process. The method adjusts the inference range according to the local semantic uncertainty and fuses labels over time to improve map stability and computational efficiency. Experiments on the SemanticKITTI dataset show that the proposed method improves mapping accuracy by about 12% and reaches an mIoU of 54.92%, which is 13.18 percentage points higher than spatial-only mapping. These results show that spatiotemporal reasoning is effective for continuous semantic mapping in autonomous robotic systems.
Semantic-Aware Autonomous Exploration for UAVs in Unknown Indoor Environments
Autonomous exploration in unknown environments requires unmanned aerial vehicles (UAVs) to efficiently generate informative trajectories while simultaneously constructing accurate maps. Although many existing exploration methods rely on geometric information, they often lack semantic awareness, resulting in suboptimal exploration efficiency and limited environmental understanding. To address this limitation, this paper proposes a semantic-aware exploration framework that adds semantic information to a roadmap-based exploration strategy. The proposed method builds on the Dynamic Exploration Planner (DEP), which incrementally constructs a Probabilistic Roadmap (PRM), and augments this roadmap with a semantic layer. A semantic reward function is introduced to prioritize regions containing meaningful objects and structures, enabling the UAV to focus on areas with higher information value. Furthermore, the roadmap is continuously updated to support efficient frontier selection and path planning during exploration. The proposed framework is implemented in ROS Noetic and Gazebo using an RGB-D sensor for simultaneous acquisition of geometric and semantic information. Experimental results in multiple simulated environments demonstrate that the proposed approach achieves exploration coverage rates between 90% and 94% while reducing exploration time and travel distance compared with conventional geometry-based exploration methods.
Gain Tuning Is Not What You Need: Reward Gain Adaptation for Constrained Locomotion Learning
Existing robot locomotion learning techniques rely heavily on the offline selection of proper reward weighting gains and cannot guarantee constraint satisfaction (i.e., constraint violation) during training. Thus, this work aims to address both issues by proposing Reward-Oriented Gains via Embodied Regulation (ROGER), which adapts reward-weighting gains online based on penalties received throughout the embodied interaction process. The ratio between the positive reward (primary reward) and negative reward (penalty) gains is automatically reduced as the learning approaches the constraint thresholds to avoid violation. Conversely, the ratio is increased when learning is in safe states to prioritize performance. With a 60-kg quadruped robot, ROGER achieved near-zero constraint violation throughout multiple learning trials. It also achieved up to 50% more primary reward than the equivalent state-of-the-art techniques. In MuJoCo continuous locomotion benchmarks, including a single-leg hopper, ROGER exhibited comparable or up to 100% higher performance and 60% less torque usage and orientation deviation compared to those trained with the default reward function. Finally, real-world locomotion learning of a physical quadruped robot was achieved from scratch within one hour without any falls. Therefore, this work contributes to constraint-satisfying real-world continual robot locomotion learning and simplifies reward weighting gain tuning, potentially facilitating the development of physical robots and those that learn in the real world.
comment: RSS 2025, compressed version
On the Use of AI-Driven Immersive Digital Technologies for Designing and Operating UAVs
Uncrewed Aerial Vehicles (UAVs) offer agile, cost-effective, and efficient solutions for communication relay networks. However, their modeling and control are challenging, and the mismatch between simulations and actual conditions limits real-world deployment; while maintaining adequate situational awareness remains essential for safe operation. Several studies have proposed integrating the operation of UAVs with immersive digital technologies such as Digital Twin (DT) and Extended Reality (XR) to overcome these challenges. This paper provides a comprehensive overview of the latest research and developments involving immersive digital technologies for UAVs. We explore the use of Machine Learning (ML) techniques, particularly Deep Reinforcement Learning (DRL), to improve the capabilities of DT for UAV systems, and present a case study of a DT-driven DRL pipeline that couples bidirectional physical-digital synchronisation with online recursive least-squares channel calibration for UAV resource allocation. We identify and discuss key research gaps, and propose countermeasures based on Generative AI (GAI), emphasizing the significant role of AI in advancing DT technology for UAVs. Furthermore, we review and discuss how XR technology can transform UAV operations with the support of GAI, and examine its practical challenges. Finally, we propose future research directions to further develop the application of immersive digital technologies for UAV operation.
comment: 20 pages
H-RINS: Hierarchical Tightly-coupled Radar-Inertial State Estimation via Smoothing and Mapping
Millimeter-wave radar enables robust perception in visually degraded environments, yet radar-inertial estimation remains prone to drift: sparse body-frame velocity measurements weakly constrain absolute orientation, leaving IMU biases poorly observable over the short horizons of sliding-window estimators. We propose a tightly coupled, hierarchical radar-inertial factor graph that decouples estimation into a high-rate resetting graph and a persistent global graph. The resetting graph fuses IMU preintegration, radar velocities, and adaptive ZUPT to produce smooth, low-latency odometry for real-time control. The persistent graph maintains a full state (poses, velocities, and biases) via keyframe-based geometric mapping and loop closures. Fully observable biases and their exact covariances are continuously injected from the persistent graph as priors into the resetting graph, anchoring the high-rate estimator against integration drift. Extensive evaluations demonstrate high accuracy and drift-reduced estimation at faster than real-time speeds. Code and datasets will be released upon paper acceptance.
comment: 8 pages, 8 figures
Building Generalization Into Behavior Generation Via Adaptive Compositions of Regularities
Generalization in robotics requires prior knowledge about how the world is structured, yet this structure changes from one situation to the next. This paper investigates the proposition that generalization arises from adaptively composing regularities -- predictable relationships within the robot-environment system -- into situation-appropriate structures for behavior generation. We examine this proposition by analyzing the mechanism in AICON (Active InterCONnect), a framework representing regularities as interacting processes in a differentiable network, where sensory feedback realizes composition and gradient descent generates behavior. To isolate adaptive composition as the key mechanism, we study a simple simulated problem in which all relevant regularities can be identified. We expose the resulting model to a wide range of novel conditions not considered during design, and we find that it generates context-appropriate behavior in all but one case, where encoded regularities are provably insufficient. Ablations reveal that the network automatically modulates which regularities influence behavior based on their informativeness. These results suggest that adaptive composition of regularities constitutes a powerful inductive bias for building generalization into behavior generation.
comment: 10 pages, 6 figures. v2: Add reference to Sebastián et al. and flatten PDF figures for maximum compatibility
G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation
Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.
Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare
The rapid advancement of robotics is reshaping what it means for humans and robots to coexist -- through expanded capabilities, more intuitive interactions, and deeper integration into real-world workflows. Beyond sharing physical space, this coexistence is increasingly characterized by organizational embeddedness, temporal evolution, social situatedness, and open-ended uncertainty. Because such coexistence extends beyond a single encounter, understanding healthcare robots requires looking beyond initial acceptance to how stakeholders' perceptions evolve through continued engagement. Yet, prior work has largely relied on single-point snapshots of attitudes and acceptance, offering limited insight into coexistence as a long-term, dynamic process. We address these gaps through in-depth follow-up interviews with nine participants from a 14-week co-design study on healthcare robots. We identify the human perception space, which includes four interpretive dimensions (i.e., degree of decomposition, source of evidence, scope of reasoning, and temporal orientation). We enrich the conceptual framework of human-robot coexistence by conceptualizing the mutual relationship between the human perception space and the robot design space as a co-evolving loop, in which human needs, design decisions, situated interpretations, and social mediation continuously reshape one another over time. Building on this, we propose considerate human-robot coexistence, arguing that humans act not only as design contributors but also as interpreters and mediators who actively shape how robots are understood and integrated across deployment stages. Our related prior work and supplementary materials, including the interview protocol, are available at https://byc-sophie.github.io/considerate-human-robot-coexistence/
comment: This paper has been accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
House of Dextra: Cross-embodied Co-design for Dexterous Hands
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework and generated robot hands are open-sourced and available on our website.
comment: Code and videos: https://an-axolotl.github.io/HouseofDextra/
What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy IROS 2026
Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.
comment: 9 pages, 5 Figures, Code, and videos are available at https://sites.google.com/view/design-t2-sa/home. Accepted at IROS 2026
M2HRI: An LLM-Driven Multimodal Multi-Agent Framework for Personalized Human-Robot Interaction
Multi-robot systems hold significant promise for social environments such as homes and hospitals, yet existing multi-robot systems often treat robots as functionally interchangeable, overlooking how distinct agent identities shape user perception and how such individuality changes the coordination requirements of multi-robot interaction. To address this, we introduce M2HRI, a multimodal multi-agent framework that models each robot as an identity-bearing agent through personality and long-term memory, together with a contextualized coordination mechanism that regulates agent participation. In a controlled user study (n = 105) in a multi-agent human-robot interaction (HRI) scenario, we found that most personality contrasts were distinguishable and consistently expressed. Long-term memory improved preference awareness and interaction naturalness, while contextualized coordination improved conversational flow, response appropriateness, and overlap avoidance. Together, these findings show that agent individuality and contextualized participation coordination play complementary roles in supporting coherent and socially appropriate multi-agent HRI. Project website available at https://project-m2hri.github.io/.
Certified World Models: Predictability Across Configuration, Horizon, and Resolution
Scale buys interpolation; structure buys certifiable transfer. A world model's average error does not say whether a particular rollout can be trusted, or for how long. For equivariant latent world models we give a predictability certificate: a computable region spanning configuration, horizon, and resolution. Under exact equivariance, rollout error is invariant over the monoid generated by k primitive symmetries and is certified from the k generators (Theorem A); universal orbit-flatness over equivariant targets characterizes equivariance at the function level (Lemma 2), so an unconstrained architecture cannot certify the property by construction. Approximate orbit-transfer defects propagate by the finite-time Lyapunov spectrum (Theorem B): expanding channels give a logarithmic horizon $T_j(ε)\sim\log(1/ε)/λ_j$, neutral channels accumulate recurrent defect linearly, and contracting channels accumulate a bounded nonzero floor. Exact conserved charge values are certified to all horizons only at zero defect; with one-step defect $η$, charge-value error grows at most as $Tη$. Empirically, on a 40-dimensional learned model a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2=0.98$-$0.99$) where dense and recurrent baselines fail. A cone/adapted-metric certificate reads an a-priori horizon off the model's own Jacobian, tight on uniformly hyperbolic dynamics and self-abstaining elsewhere; the resulting horizon improves a budgeted re-observation decision. For public non-equivariant world models the tangent spectrum gives a training-free candidate horizon, paired with a held-out divergence cross-check that abstains or corrects when the learned loop over-promises.
comment: 56 pages. v2: finite-time defect formulation, tightened certificate scope, updated figures/metadata; experiments unchanged. Code: https://github.com/TimothyWang418/se3-ejepa
Why That Robot? A Qualitative Analysis of Justification Strategies for Robot Color Selection Across Occupational Contexts
As robots increasingly enter the workforce, human-robot interaction (HRI) must address how implicit social biases influence user preferences. This paper investigates how users rationalize their selections of robots varying in skin tone and anthropomorphic features across different occupations. By qualitatively analyzing 4,146 open-ended justifications from 1,038 participants, we map the reasoning frameworks driving robot color selection across four professional contexts. We developed and validated a comprehensive, multidimensional coding scheme via human--AI consensus ($κ= 0.73$). Our results demonstrate that while utilitarian \textit{Functionalism} is the dominant justification strategy (52\%), participants systematically adapted these practical rationales that align with existing racial and occupational stereotypes. Furthermore, we reveal that bias frequently operates beneath conscious rationalization: exposure to racial stereotype primes significantly shifted participants' color choices, yet their spoken justifications remained masked by standard affective or task-related reasoning. We also found that demographic backgrounds significantly shape justification strategies, and that robot shape strongly modulates color interpretation. Specifically, as robots become highly anthropomorphic, users increasingly retreat from functional reasoning toward \textit{Machine-Centric} de-racialization. Through these empirical results, we provide design implications to help reduce the perpetuation of societal biases in future workforce robots.
Adaptive vs. Static Robot-to-Human Handover: A Study on Orientation and Approach Direction
Robot-to-human handovers often rely on static, open-loop strategies (or, at best, approaches that adapt only the position), which generally do not consider how the object will be grasped by the human, thus requiring the user to adapt. This work presents a novel adaptive framework that dynamically adjusts the object's delivery pose in real time based on the user's hand pose and the intended downstream task. By integrating AI-based hand pose estimation with smooth, kinematically constrained trajectories, the system ensures a safe approach and an optimal handover orientation. A comprehensive user study compares the proposed adaptive approach against a static baseline across multiple tasks, evaluating both subjective metrics (NASA-TLX, Human-Robot Trust Scale) and objective physiological data (blink rate measured via wearable eye-trackers). The results demonstrate that dynamic alignment significantly reduces users' cognitive workload and physiological stress, while improving their confidence in the robot's reliability. These findings highlight the potential of task- and pose-aware systems for enabling fluid and ergonomic human-robot collaboration.
comment: The paper has been accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
From Discrete Plans to Real-World Execution: A World-Model-Driven Framework for Execution-Aware Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) studies how to coordinate multiple agents to reach their goals without collisions and underpins a range of large-scale robotic systems, including automated warehousing and manufacturing. Recent advances enable MAPF solvers to compute high-quality plans for hundreds of agents. However, these plans are generated using simplified robot models with discretized time and action spaces. When they are deployed in physical systems, heterogeneous robot dynamics, asynchronous interactions, communication delays, and other real-world factors can lead to substantial deviations from planned performance. We bridge the gap between discrete planning and real-world execution through ExecTimeNet, a learned world model of MAPF execution that predicts how a discrete MAPF solution will unfold on physical robots, mapping each discrete action to its realized execution state, including its wall-clock completion time and the kinodynamic state in which it ends. Building on this capability, we first propose REMAP, an execution-aware MAPF framework that integrates execution-time estimation into planning, guiding the search toward MAPF solutions with improved execution performance. We also introduce ESADG, a post-planning optimization procedure that optimizes the execution schedule of a given MAPF solution while preserving path feasibility. We evaluate proposed frameworks in high-fidelity simulation with up to 300 agents and on physical robots. In simulation, ExecTimeNet predicts the execution state accurately and transfers to unseen maps and agent counts. Across simulation benchmarks spanning diverse map topologies, REMAP reduces delays by up to 21% over baselines, while ESADG achieves up to 40% normalized improvement. On physical hardware, the full pipeline reduces total execution time by up to 15.3%, demonstrating effective transfer from simulation to real-world deployment.
Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation
Relational object rearrangement (ROR) tasks (e.g., insert flower to vase) require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an object-action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and the real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies. More visualizations can be found at https://sites.google.com/view/imagine2act.
Multiagent Systems
SHACR: A Graph-Augmented Semi-Autonomous Framework for Multi-Class Conflict Resolution in Smart Home IoT Automation
Smart home automation increasingly relies on user-defined rules across heterogeneous IoT devices. While these rules appear harmless in isolation, their concurrent execution creates hidden, cross-rule interactions via shared devices, environmental variables, and physical topology. These interactions result in unsafe, wasteful, or privacy-threatening behaviors that are completely invisible to text-only analysis. Existing conflict detectors remain siloed, catching either static syntactic conflicts or specific environment-mediated interactions without unifying the two or providing actionable repairs for non-expert users. This paper presents SHACR, a smart home conflict resolution framework that anchors Large Language Model (LLM) unpredictability by grounding its reasoning in a formal, directed knowledge graph. SHACR encodes devices, capabilities, physical states, and Trigger-Condition-Action rules as typed, traversable entities. By elevating physical cause-effect relationships to first-class graph edges, SHACR transforms conflict detection from fragile text inference into deterministic multi-hop graph traversal, unifying logical, semantic, and physical conflict classes. It drives a closed-loop Scan-Explain-Repair-Validate workflow that uses the graph to bound the LLM's action space. We evaluated SHACR on a testbed of 203 rules deployed across 70 apartments within a smart building. By holding the underlying LLM fixed and introducing SHACR's knowledge graph, classification errors drop by 36.7\%, F1 rises from 0.59 to 0.79, and few-shot calibration further lifts F1 to 0.95, whereas the same calibration barely helps a graph-free LLM. Ultimately, this work challenges the current AI paradigm, establishing that structured knowledge representation is a far more critical factor for dependable IoT automation management than prompt engineering or underlying model architecture.
GARIP: A Running-Average Moving Reference for Last-Iterate Self-Play in Two-Player Zero-Sum Games
Self-play with naive gradient ascent cycles in two-player zero-sum games: the last iterate orbits the equilibrium. Modern methods restore last-iterate convergence by regularizing toward a reference policy -- MMD a fixed one (reaching only the regularized equilibrium), R-NaD a periodic snapshot (the engine of DeepNash). We study GARIP, which anchors to the running average, and isolate what the choice of reference controls. Our central result is a mechanism: collapse tracks the peak lag of the reference, and among causal convex averages of a fixed mean lag the running average (flat profile, peak $=$ mean) uniquely minimizes that peak, while a snapshot's sawtooth has peak $= 2\times$ mean (a one-line theorem). Two consequences follow. Convergence: we prove local last-iterate convergence at constant anchor strength -- the anchor scales the base map's rotation by $1-β$, crossing the stability boundary and turning a recurrent base into a contraction (global convergence is conjectured at small $β$; we characterize a large-$β$ consensus failure). Robustness: GARIP matches R-NaD's peak performance -- on matrix games, the Coin Game, and the board games Connect Four/Othello, both moving references are far more robust than fixed-magnet and magnet-free baselines -- but is the better hyperparameter default; we report it both ways: over the full grid collapse rates are statistically indistinguishable, yet at conventional parameterizations a matched-mean-lag setting collapses in 0/40 vs 10/40 seeds (a snapshot matches it only by knowing to shorten $K$). The boundaries: an anticipatory (negative-weight) reference does better still on the stale side, and the advantage appears only where naive self-play cycles (five deep self-play loops). All experiments are pure JAX and reproducible.
comment: 14 pages, 9 figures
Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware
Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.
comment: 20 pages, 10 figures, 1 Table. This work has been submitted to the IEEE for possible publication
Disentangling Intrinsic Importance from Emergent Structure in Multi-Expert Orchestration
Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and functional attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient sensitivity: frequently selected experts often act as interaction hubs with limited influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes functional and structural dependencies beyond accuracy metrics alone. Our code is available at https://github.com/parmanu-lcs2/inform.
comment: Accepted by Transactions on Machine Learning Research (TMLR)
Distributionally Robust Markov Games with Average Reward ICML 2026
We propose and study distributionally robust Markov games (DR-MGs) with the average-reward criterion as a crucial framework for multi-agent decision-making under model mismatches and over extended horizons. Under a standard irreducible assumption, we first derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, based on which we further show the existence of a stationary Nash Equilibrium (NE) of the game. We further study DR-MGs under a more general weakly communicating setting. We construct a set-valued map based on the constant-gain optimal robust Bellman operator and show that its value is a subset of the best-response policies. We further prove that this map admits a fixed point, which implies the existence of NE. We then design two algorithms, Robust Nash-Iteration and robust TD Descent, with provably convergent guarantees. Finally, we show that the NE under average-reward can be approximated by the ones for the discounted DR-MGs as the discount factor approaches one. Our studies provide a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.
comment: Accepted by ICML 2026
Mesh Inference: A Formal Model of Collective Inference Without a Center
We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal characterization of when a center-free, observation-only mesh recovers the centralized optimum.
comment: 22 pages, 3 figures. v2: added related work (belief-sharing / Theory-of-Mind collective intelligence); reproducible accuracy figure (exact mesh = centralized identity); new duality figure (identification and confidentiality as one rank threshold) and a summary table; tightened the transient statement, identification precision condition, and headline claim
Systems and Control (EESS)
Dynamic Resilience Assessment of Power Systems With Data Center Load Events Using Physics-Informed Neural Networks
Large data center loads introduce new resilience challenges to power systems because their disconnection and staged reconnection can induce fast voltage and frequency dynamics that are not captured by static service-status or energy-based metrics. This paper proposes a utility-side, physics-informed resilience assessment framework that evaluates these events using only grid-side dynamic models and observable post-disturbance trajectories, without requiring detailed internal data center models. An unsupervised differential algebraic equation-physics informed neural network (DAE-PINN) based on an implicit backward Euler residual is developed to jointly predict dynamic and algebraic states, enabling repeated post-disturbance trajectory evaluation while enforcing network algebraic consistency. Normalized multi-phase resilience metrics are then used to quantify disturbance, degraded-state, and restoration-period impacts and to screen data center reconnection timing and load-ramping strategies under security constraints. Case studies on a modified IEEE 33-bus feeder show that the proposed DAE-PINN accurately tracks numerical DAE solutions and substantially reduces computation time in repeated restoration screening. The proposed metrics distinguish the effects of disturbance size, data center location, and reconnection strategy, revealing the trade-off between restoration speed and transient resilience loss.
Towards an FMI Layered Standard for DAE: Applications for Simulation and Optimization
The Functional Mock-up Interface (FMI) 3.0 standard for Model Exchange is restricted to hybrid ordinary differential equations, requiring any internal algebraic equations to be solved inside the Functional Mock-up Unit (FMU) before derivatives are returned to the importer. For models originating from, e.g. Modelica, this means that nonlinear algebraic equations must be solved through internal Newton iterations, which can reduce accuracy, increase computational cost, introduce hidden solver states, and cause robustness issues in downstream simulation and optimization workflows. In this article, we present a proposal for a layered standard, fmi-ls-dae, that exposes algebraic equations and their associated algebraic variables as part of a semi-explicit index-1 differential-algebraic equation. We describe the proposed extensions to the FMI XML schema and demonstrate the approach through prototype implementations: Dymola and CasADi generate FMUs that expose this semi-explicit index-1 formulation, while CasADi, FMIOPT, Simcenter Twin Activate, and MOO (the dynamic optimization tool of OpenModelica) import them for simulation and dynamic optimization. On an industrially relevant multilink suspension corner model, the proposed DAE-FMU formulation enables the optimization routine to converge on an optimal control problem on which the equivalent ODE-FMU fails to converge. We outline ongoing work towards supporting higher-index DAEs, consistent initialization, and event handling,
comment: Submitted to American Modelica & FMI Conference 2026
Generative Robust Optimisation
Classical uncertainty sets for robust optimisation impose fixed geometric shapes that cannot represent the complex dependencies present in real-world data. We propose Generative Robust Optimisation (GRO), a framework in which a deep generative model defines the uncertainty set as the image of a neural network decoder over a calibrated latent set, naturally accommodating nonlinear correlations, asymmetry, and multimodality. A five-point evaluation framework (reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractability) provides systematic, model-agnostic criteria for assessing any neural network-based uncertainty set. We instantiate this framework with a Wasserstein Adversarial Autoencoder employing Gaussian mixture model-guided training for latent regularity and constraint-consistency regularisation for robust relevance. Restricting the decoder to ReLU activations enables exact worst-case verification through mixed-integer programming embedding. Extensive experiments on a production planning problem across six uncertainty distributions and six generative architectures, together with a multi-period facility location study, validate the framework and demonstrate that systematic attention to all five criteria yields uncertainty sets that are simultaneously expressive, well-calibrated, and optimisation-tractable.
LAWNs Meet SWIPT: Beamforming and Power Splitting Optimization for Predictive Control
Simultaneous wireless information and power transfer (SWIPT) has emerged as a promising paradigm for enabling sustainable connectivity in battery-limited low-altitude wireless networks (LAWNs). This paper investigates a SWIPT-enabled LAWN system in which a multi-antenna base station (BS) simultaneously delivers control information and wireless energy to a fleet of uncrewed aircraft systems (UASs) via power splitting. In particular, the BS remotely guides the UASs to accurately track predefined reference trajectories toward their destinations while avoiding multiple mobile no-fly zones (NFZs). To guarantee collision-free path planning, we first construct smooth and safe reference trajectories using stream function theory. Then, a real-time optimization problem is formulated, which jointly takes into account the wireless control cost and energy sustainability by optimizing control inputs, transmit beamforming vectors, and the power splitting ratios. To address the resultant non-convex problem, a two-stage optimization framework is proposed. First, we develop a model predictive control (MPC)-based method to generate predictive control inputs. Subsequently, we derive a computationally efficient iterative algorithm to optimize the beamforming vectors and power splitting ratios by applying semidefinite relaxation (SDR) and successive convex approximation (SCA) techniques. We further prove that the SDR is tight for our formulation. Extensive numerical results demonstrate that our proposed design significantly outperforms benchmark schemes in terms of tracking accuracy and harvested energy, thereby validating its effectiveness for sustainable implementation in LAWN systems.
Physics-Informed Predictive Control for Integrated Electric-Vehicle Thermal Management: An Open, Real-Data-Anchored Benchmark
Thermal management in a battery-electric vehicle (BEV) is a coupled, vehicle-level problem: the battery pack, the passenger cabin, the heat pump, and cabin air quality compete for shared actuation and energy, yet most studies optimise a single subsystem on proprietary models, which prevents fair, reproducible comparison. We present OpenEV-ThermoSciML, an open and reproducible benchmark that couples a battery electro-thermal-aging model, a two-node cabin model, a heat-pump/HVAC model, and a CO$_2$/ventilation model under real driving cycles (EPA) and real weather (NREL TMY3, NASA POWER), scored by a multi-objective suite spanning battery health, PMV/PPD comfort, cabin air quality, and HVAC energy. The benchmark's battery thermal core is anchored and validated on real BEV battery-management-system (BMS) data; the reduced battery (two-state) and cabin (two-node) models are validated against converged higher-fidelity references and, for the cabin, independently cross-checked against EnergyPlus 25.2.0. On top of the benchmark we develop a physics-informed scientific-machine-learning (Sci-ML) surrogate -- a nominal-physics prior plus a learned residual with conservation penalties -- that is exact on conserved quantities and dominates black-box and Koopman surrogates out-of-distribution (overall rollout RMSE 0.014 vs 1.168 and 3.991). A shielded Sci-ML model-predictive controller (MPC) delivers statistically significant, all-positive improvements over a production-like rule-based controller across six scenarios -- including a real hot-day US06 trip (energy $-15\%$, comfort RMSE $-47\%$, peak CO$_2$ $-25\%$, battery thermal-gradient $-78\%$) -- and these gains transfer to an independently exported OpenModelica 8-node co-simulation plant.
comment: 7 pages, 2 figures
An Enhanced Submodule for Modular Multilevel Converter with DC Fault Ride-Through Capability
Modular multilevel converter (MMC) has been successfully applied in various power electronic systems owing to its high efficiency, scalability, and superior output performance. Although the half-bridge submodule (HBSM) is widely used in MMCs for its structural simplicity, it is incapable of handling direct-current (DC) short-circuit faults. The diode-clamp submodule (DCSM) addresses this limitation by providing DC fault ride-through capability. However, because its two identical capacitors are connected in series, the equivalent capacitance is halved. To overcome this drawback, an enhanced SM is proposed in this paper. For the same energy storage capacity, the proposed SM reduces the total required capacitance by 75% compared with the DCSM. In addition, the proposed SM requires one fewer diode than the DCSM, thereby lowering the overall MMC capital cost. The topology and the operating modes of the proposed SM are described in detail, and its functionality is experimentally validated. The results demonstrate that the proposed SM can suppress DC fault currents and restore normal operation without additional external protection devices.
comment: 4 pages, 8 figures
Stateful Pricing and Allocation for Repeated Constrained DER Coordination in Distribution Networks
Distribution networks with high penetrations of distributed energy resources (DERs) must repeatedly allocate limited network capability in two directions: under import scarcity, which flexible demand is served, and under export congestion, which generation is curtailed. Dynamic operating envelopes (DOEs) enforce hard feasibility bounds but lack intertemporal correction, while dynamic network prices (DNPs) provide an allocative signal but cannot guarantee constraint satisfaction. This paper develops a stateful cyber-physical coordination mechanism, termed an Automatic Market Maker (AMM), as an additive coordination layer for machine-to-machine DER access. The mechanism combines dual fairness states for import and export, bounded bilateral prices driven by a voltage-aware deficit signal, and feasibility-constrained matching within a two-tier MV/LV architecture. Experiments on the CSIRO MV+33LV feeder dataset compare five mechanisms and benchmark the fair-over-time DOE formulations of Moring et al. (FET, FOT, FUH). Relative to equal-allocation DOE, the AMM reduces unserved flexible demand by 76% (96.0 MWh to 23.2 MWh) with zero thermal violations and reduces export curtailment from 85.4 MWh to 64.5 MWh. Near-identical DOE and DOE-GREEDY performance confirms that heuristic choice alone does not improve repeated constrained outcomes. The AMM reaches an annual inter-feeder Jain index of 0.9998, outperforming all DOE variants from month 6 onwards. Direct benchmarking against FET/FOT/FUH shows that these mechanisms achieve higher worst-feeder equity through an explicit max-min MV objective, but operate offline over predetermined horizons and do not provide bilateral scarcity signals, real-time operation, or participant-level intertemporal correction. The two approaches address different objectives and may be combined in future work.
Enhancing Road Safety: An IoT-Based Accident Detection and Prevention Mechanism
Road traffic accidents remain a critical global crisis, consistently serving as a primary driver of preventable mortality and severe injury. These incidents are frequently precipitated by human error, including overspeeding, driving under the influence of alcohol, and cognitive fatigue. To address this urgent public safety challenge, this paper presents an intelligent, Internet of Things (IoT)-based Accident Prevention and Detection System (APDS) designed to systematically mitigate driver risk and optimize post-collision emergency responses. The proposed framework features a multi-tiered architecture capable of executing continuous real-time telemetry monitoring, proactive local alarm triggering, and automated situational intervention. Furthermore, the system integrates automated emergency communication protocols that aggregate immediate spatial coordinates via GPS and dispatch targeted alerts to medical facilities in close proximity, thereby optimizing response times and reducing accident-related fatalities.
comment: 4 pages, 4 figures, 1 table
Control-Aware Manipulation of ArduPilot via Legitimate MAVLink Commands: Simulation and Hardware Validation
This paper investigates control-aware attacks against ArduPilot-based Unmanned Aerial Vehicles (UAVs), inwhich an adversary exploits the sensitivity of flight-controller dynamics to parameter changes to cause loss of control and crashes. It describes six attacks that exploit interactions among multi-layer controllers by modifying Proportional-Integral-Derivative (PID) gains, altering Extended Kalman Filter (EKF) estimation configuration, and violating failsafe assumptions, thereby forcing ArduPilot into unsafe operating conditions. We evaluate the attacks in Software-in-the-Loop (SITL) simulation and validate them on a Pixhawk 2.4.8 hardware platform. The results show that short sequences of well-formed MAVLink messages can exploit controller sensitivity to parameter values and updates frequency, affecting controller states and degrading attitude stability, angular-rate behavior, trajectory tracking, and estimator health. We demonstrate that when multiple effects are combined, the vehicle can enter an unsafe state and crashes. These findings show that security gaps in input-parameter handling, command trust, and controller-state validation can be exploited to cause loss of control and crashes in UAVs.
Any-Body Guard: Universal Safeguarding for Manipulation Policies via Action Masking
Ensuring safety of learning-enabled robotic manipulation across diverse embodiments and tasks still requires significant manual engineering. Existing approaches typically rely on heuristically designed fallback controllers or complex forward invariance assessments. These methods are often too conservative for task success, too computationally expensive for real-time execution, too heuristic to provide useful safety guarantees, or too engineering-heavy to transfer between setups. In this paper, we propose a universal safeguarding approach, X-Safe, which reasons directly in the robot's configuration space to provide formal probabilistic guarantees for collision avoidance. By operating in the configuration space, our method transfers across embodiments while relying solely on an object-based, quasi-static scene representation and a forward kinematics model of the robotic manipulator. Thus, X-Safe provides useful formal safety guarantees without requiring additional data, or engineering effort for different embodiments or scenes. We demonstrate X-Safe for diverse embodiments and policies, both in simulation and on hardware. We observe less degradation in task performance compared to state-of-the-art safeguarding, no collisions on hardware experiments, and empirically corroborate our formal guarantees.
Active Sensing and Deferred-Decision Trajectory Optimization for Robust Target Identification
We study trajectory optimization in mobile sensing systems that must identify which member of a finite candidate set is the true target, while maintaining reachability to all potential candidate targets, under resource constraints. Deferred-Decision Trajectory Optimization (DDTO) addresses this setting by computing trajectories that reach individual targets but remain coincident for as long as possible before separating toward different targets. We propose Active-Sensing DDTO (AS-DDTO), which extends DDTO by adding a trajectory-dependent information-acquisition term to the planning objective. The resulting planner maintains reachability to candidate targets while biasing the coincident portion of the trajectories toward regions that enable earlier target identification. The framework supports Bayesian updates and conformal candidate-set updates for distance-dependent sensing. We derive a mixed-integer conic reformulation and provide guarantees on recursive feasibility, belief concentration, and fixed-time coverage for the raw conformal candidate set. Numerical simulations show improved target identification compared with standard DDTO under distance-dependent sensing uncertainty and limited sensing budget.
comment: Published in IEEE Control Systems Letters (L-CSS), 2026. 6 pages
A Geometric Solution of the Schrödinger Bridge Problem on $\mathsf{SO}(2)$ via Stochastic Optimal Control
We present a geometric coordinate-free solution to the isotropic Schrödinger bridge problem (SBP) for the kinematic equation on the Lie group $\mathsf{SO}(2)$. We consider the angular velocity of the system as the control input and assume that the given initial and terminal state probability density functions defined on $\mathsf{SO}(2)$ in our SBP are continuous and strictly positive. We solve the SBP by proving the existence and uniqueness of a solution to the so-called Schrödinger system of equations on $\mathsf{SO}(2)$, by showing that a fixed-point recursion is contractive in a complete metric space with respect to the Hilbert's projective metric. The geometric controller thus designed only uses the intrinsic geometric structure of $\mathsf{SO}(2)$ and does not embed it in the Euclidean plane to achieve the optimal density control. The numerical simulation verifies the validity of the theoretical construction of the Schrödinger bridge. The code and animations are publicly available at \texttt{\href{https://gitlab.com/a5akhtar/sbp}{https://gitlab.com/a5akhtar/sbp}}.
comment: 6 pages, 1 figure, accepted at the European Control Conference
LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor
Forecasting chaotic dynamical systems such as the Lorenz attractor is notoriously difficult: small numerical errors are amplified exponentially over long autoregressive rollouts. We study seven recurrent and convolutional architectures for the AI-DEEDS 2026 Chaotic Systems Challenge: a vanilla LSTM, an LSTM with additive attention, a Bidirectional LSTM (BiLSTM), a BiLSTM trained with the Huber loss, a Temporal Convolutional Network (TCN), a CNN front-end followed by an LSTM, and a CNN front-end followed by a BiLSTM. All models share the same pre-processing, sequence length, and rollout procedure, isolating the contribution of each design choice. The challenge scores predictions on a 0-100 scale where higher is better. We obtain leaderboard scores between 45.72 and 58.81, with the BiLSTM trained with Huber loss being the strongest configuration. Two findings stand out: (i) adding additive attention to the unidirectional baseline degraded performance by over ten points, and (ii) prepending a CNN front-end to either an LSTM or a BiLSTM did not help and slightly hurt the score. Per-pair RMSE measurements confirm that the BiLSTM family generalizes better in the harder pairs (6-7), while the LSTM + Attention model collapses there (RMSE up to 8.94 on pair 6). We discuss why bidirectional context and a robust loss help in chaotic regimes while attention and CNN front-ends fail in this setting.
Reconstruction of chaotic systems in invariant jet space
Takens' theorem is the gold standard for attractor reconstruction from time series, but it guarantees only topological equivalence and does not preserve metric or group properties such as symmetries. We show that switching from delay-coordinate space to jet space (signal and its derivatives) allows one to exactly preserve the symmetry group of the original system. This statement is rigorously justified by a theorem on the isomorphism of Lie algebras under jet prolongation. Numerical experiments on the Lorenz and Rössler systems confirm that jet-space reconstruction preserves geometry and symmetries, whereas Takens embedding distorts them. As quantitative metrics we use a variational elastic energy functional and the correlation dimension. It is shown that jet-space reconstruction not only outperforms Takens embedding but in some cases yields more accurate estimates of invariants than projections of the original system. The proposed approach provides a coordinate-invariant criterion for the classification of strange attractors and can serve as a basis for detecting hidden attractors.
comment: 18 pages, 2 figures
Achieving $\widetilde{O}(1/ε)$ Sample Complexity for Bilinear Systems Identification under Bounded Noises
This paper studies finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances. Our analysis considers trajectory-dependent regressors and allows marginally stable dynamics with polynomial mean-square state growth. We prove that the diameter of the feasible parameter set shrinks with sample complexity $\widetilde{\mathcal O}(1/ε)$ where $ε$ is the estimation error. Simulation supports the theory and illustrates the advantage of the proposed estimator for uncertainty quantification.
comment: 14 pages, 2 figures. Accepted by IEEE Control Systems Letters
Cyber Resilience Assessment of Unbalanced Distribution System Restoration under Sparse Load Forecasting Attacks
System restoration is critical for power-system resilience, but its growing reliance on artificial intelligence (AI)-based load forecasting creates a cyber-physical vulnerability in the restoration decision loop. Manipulated forecasts can cause infeasible restoration schedules, insufficient inverter-based-resource ramping margins, and unsuccessful recovery of de-energized segments, yet the resilience of restoration processes to such attacks remains largely unexplored. This paper evaluates restoration vulnerability at the system level rather than only measuring forecasting error. A gradient-based sparse perturbation method is developed as a stress-testing tool to identify influential forecasting inputs. We further create a restoration-aware validation framework that embeds these compromised forecasts into a sequential restoration model and evaluates operational feasibility using an unbalanced three-phase optimal power flow formulation. Case studies on a modified IEEE 123-bus feeder show that sparse input perturbations can substantially increase forecasting error and make selected microgrid restoration stages infeasible. The results reveal system-level failures caused by active-power-balance infeasibility and power ramping violations, which can prevent the restoration of critical loads. These findings provide actionable insights for designing cybersecurity-aware restoration planning frameworks.
comment: 10 pages, 7 figures
Joint Visible Light and RF Backscatter Communications for Ambient IoT Network: Fundamentals, Applications, and Opportunities
The rapid growth of the Internet of Things (IoT) devices in the sixth generation (6G) wireless networks raises significant sustainability and scalability challenges due to energy consumption, deployment complexity, and environmental impact. Ambient IoT (A-IoT), leveraging ambient energy harvesting (EH) for batteryless device operation, has emerged as a promising solution to address these challenges. Among various EH and communication techniques, visible light communication (VLC) integrated with ambient backscatter communication (AmBC) offers remarkable advantages, including energy neutrality, high reliability, and enhanced security. In this article, we propose a joint VLC-AmBC architecture, emphasizing fundamental concepts, system designs, and practical implementations. We explore potential applications in environmental monitoring, healthcare, smart logistics, and secure communications. We present proof-of-concept demonstrations for three distinct types of ambient backscatter devices (AmBDs): EH-Only, VLC-Relay, and VLC-Control. Experimental results demonstrate the feasibility of implementing joint VLC-AmBC systems, highlighting their practical viability across various deployment scenarios. Finally, we outline future research directions, including integrated sensing and communication, as well as optimized energy-efficient deployment. Open issues, such as large-scale deployment challenges, are also discussed, thereby providing a clear roadmap for future developments in joint VLC-AmBC-enabled A-IoT ecosystems.
comment: 7 pages, 5 figures, 1 table
A Stable, Accurate, and Well-Conditioned Time-Domain PMCHWT Formulation
This paper introduces a new boundary element formulation for transient electromagnetic scattering by homogeneous dielectric objects, based on the time-domain PMCHWT equation. To address dense-mesh breakdown, a multiplicative Calderón preconditioner constructed from a modified static electric field integral operator is employed. Large-timestep breakdown and late-time instability are simultaneously resolved through a rescaling of the Helmholtz components using quasi-Helmholtz projectors, with temporal differentiation and integration serving as the rescaling operators. This rescaling additionally balances the loop and star components in the large-timestep regime, thereby preventing loss of accuracy in the secondary quantities caused by numerical cancellation. The resulting discrete system is solved using a marching-on-in-time scheme in conjunction with iterative solvers. Numerical experiments for simply- and multiply-connected dielectric scatterers, including highly non-smooth geometries, corroborate the stability and efficiency of the proposed approach and demonstrate its ability to produce accurate derived quantities in the large-timestep regime.
comment: 16 pages, 10 figures
Wind-Resilient Trajectory Optimization for UAV-BS Networks: TD3 for Continuous Service Availability
Unmanned aerial vehicle (UAV)-mounted base stations are highly susceptible to wind disturbances such as gusts and turbulence, which induce positional drift and degrade communication link quality, particularly in emergency scenarios. To address this challenge, we propose a DRL-based framework for wind-resilient trajectory adjustment and positioning based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. The method models wind as a stochastic kinematic perturbation, avoiding complex aerodynamic modeling, thereby enabling the TD3 agent to learn adaptive control policies that maintain optimal coverage footprints. By prioritizing user-centric performance metrics under turbulent conditions, the proposed architecture ensures continuous service availability despite external disruptions. Simulation results demonstrate that the TD3-based approach effectively compensates for wind-induced displacements and outperforms benchmark methods, including Proximal Policy Optimization (PPO), in terms of throughput stability and robustness in windy environments.
WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network
With the advent of system-in-package (SiP) chiplet-based design and heterogeneous 2.5D/3D integration, thermal-induced warpage has become a critical reliability concern. While conventional numerical approaches can deliver highly accurate results, they often incur prohibitively high computational costs, limiting their scalability for complex chiplet-package systems. In this paper, we present WarPGNN, an efficient and accurate parametric thermal warpage analysis framework powered by Graph Neural Networks (GNNs). By operating directly on graphs constructed from the floorplans, WarPGNN enables fast warpage-aware floorplan exploration and exhibits strong transferability across diverse package configurations. Our method first encodes multi-die floorplans into reduced Transitive Closure Graphs (rTCGs), then a Graph Convolution Network (GCN)-based encoder extracts hierarchical structural features, followed by a U-Net inspired decoder that reconstructs warpage maps from graph feature embeddings. Furthermore, to address the long-tailed pattern of warpage data distribution, we developed a physics-informed loss and revised a message-passing encoder based on Graph Isomorphic Network (GIN) that further enhance learning performance for extreme cases and expressiveness of graph embeddings. Numerical results show that WarPGNN achieves more than 205.91x speedup compared with the 2-D efficient FEM-based method and over 119766.64x acceleration with 3-D FEM method COMSOL, respectively, while maintaining comparable accuracy at only 1.26% full-scale normalized RMSE and 2.21% warpage value error. Compared with recent DeepONet-based model, our method achieved comparable prediction accuracy and inference speedup with 3.4x lower training time. In addition, WarPGNN demonstrates remarkable transferability on unseen datasets with up to 3.69% normalized RMSE and similar runtime.
comment: Accepted to IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) 2026
A Lightweight MPC Bidding Framework for Brand Auction Ads
Brand advertising plays a critical role in building long-term consumer awareness and loyalty, making it a key objective for advertisers across digital platforms. Although real-time bidding has been extensively studied, there is limited literature on algorithms specifically tailored for brand auction ads that fully leverage their unique characteristics. In this paper, we propose a lightweight Model Predictive Control (MPC) framework designed for brand advertising campaigns, exploiting the inherent attributes of brand ads -- such as stable user engagement patterns and fast feedback loops -- to simplify modeling and improve efficiency. Our approach utilizes online isotonic regression to construct monotonic bid-to-spend and bid-to-conversion models directly from streaming data, eliminating the need for complex machine learning models. The algorithm operates fully online with low computational overhead, making it highly practical for real-world deployment. Simulation results demonstrate that our approach significantly improves spend efficiency and cost control compared to baseline strategies, providing a scalable and easily implementable solution for modern brand advertising platforms.
A Benchmark Library for Distributed Power System Analysis and Optimization
DPLib is an open-source benchmark library created to support research and development in distributed power system analysis and optimization. Unlike centralized tools such as MATPOWER and PGLib, no general purpose, reproducible data library package currently exists for distributed power system studies. DPLib, available at \href{https://github.com/LSU-RAISE-LAB/DPLib.git}{GitHub}, fills this gap by providing 40 multi-region benchmark test cases ranging from 5 buses to 20758 buses, along with a graph-based partitioning toolkit that converts MATPOWER-compatible systems into distributed regional datasets. The toolkit generates standardized \texttt{.mat}, \texttt{.csv}, and \texttt{.m} files, regional MATPOWER version 2 cases, local and global bus mappings, generator and cost assignments, explicit inter-regional tie-line records, and bus-to-region partition maps. It supports unweighted, electrically weighted, and user-defined partitions, and is compared with METIS, KaFFPa, and an IPA-inspired baseline. DPLib also provides ADMM-based distributed DC and AC OPF solvers for validation. Numerical studies report partitioning sensitivity, centralized run times, distributed OPF iterations, run times, and optimality gaps. These results establish DPLib as a reproducible data layer for distributed power system research.
From Net Load Modifiers to Firm Capacity: The Role of Distributed Energy Resources in Resource Adequacy
Distributed energy resources such as rooftop solar, batteries, demand response, and electric vehicles can support resource adequacy, but existing rules do not always specify when their performance can count as capacity. This review examines the institutional pathway through which distributed-resource capability is forecast, qualified, verified, accredited, and enforced. We organize the pathway into five stages: load forecasting, registration and classification, metering and verification, capacity accreditation, and performance obligations. We synthesize academic literature, tariffs, market manuals, and regulatory documents from California, PJM Interconnection, ISO New England, Great Britain, and Ireland. Across these resource adequacy frameworks, similar participation barriers recur despite different procurement models and regulatory structures. These barriers arise less from the technologies themselves than from cross-stage couplings in the rules that translate physical capability into counted capacity. Registration categories can lock hybrid portfolios into mismatched obligations; verification evidence can be too coarse or difficult to audit for accreditation methods; and planning forecasts can become misaligned with delivery conditions. These couplings explain why single-stage reforms often fail to expand participation. The review argues that distributed resources can count toward adequacy only when planning assumptions, qualification rules, verification evidence, accreditation methods, and enforcement obligations are designed as a coordinated pathway. This implies reforms that codify information handoffs across stages, tie accreditation to auditable performance evidence, and refresh capacity values as resource deployment changes system conditions.
Robotics
Geometric Reconstruction of Extrinsic Contact Trajectories using Tactile Sensing and Proprioception for Tool Manipulation IROS 2026
Tactile sensing enables robots to perceive rich contact information at the grasp, supporting tasks such as object recognition, in-hand pose estimation, and slip detection. However, in many tool-mediated manipulation tasks, the interaction that determines task success occurs at the tool tip, away from the tactile sensor, making direct sensing of tool-environment contact difficult, particularly when the contact moves during interaction. In this work, we reconstruct the trajectory of extrinsic tool-tip contact using tactile sensing and robot proprioception. We formulate tool-tip trajectory reconstruction as a geometric inference problem under a single-point contact assumption. Our method first estimates the global tool-tip contact location from a calibration segment designed to approximate fixed-point behavior, and then reconstructs the full trajectory by composing relative tool motion estimated from tactile marker observations under continuous contact. Across n=51 trials with multiple trajectories, tools, wrist poses, and grasp configurations, the proposed pipeline achieves a trajectory RMSE of 8.59 +/- 2.41 mm in the world frame and a shape RMSE of 5.96 +/- 1.16 mm, while operating online at 14.00 +/- 4.11 Hz. Overall, the results show that extrinsic tool-tip trajectory geometry can be recovered consistently from grasp-level tactile sensing, with trajectory shape remaining stable across variations in tools, wrist poses, and grasp configurations.
comment: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation
Whole-body humanoid loco-manipulation requires coordinating the robot's entire kinematic chain. However, most existing systems typically decouple the upper and lower bodies into separate controllers, limiting such coordination and yielding behaviors similar to those of a wheeled dual-arm platform. In this paper, we ask what it takes to build a whole-body native vision-language-action (VLA) model that maps language and pixels directly to all of the humanoid's degrees of freedom. We conduct a systematic empirical study organized as a roadmap of one-variable-at-a-time experiments across three phases: whole-body teleoperation, VLA model design, and heterogeneous co-training. Our study yields several intriguing findings: a joint-based whole-body teleoperation interface outperforms alternatives that only partially expose the humanoid's degrees of freedom; a VLA pretrained on static and wheeled dual-arm platforms transfers surprisingly well to a humanoid's full action space; and co-training with HuMI, the humanoid analog of UMI, extends the policy to new objects and instructions without additional whole-body teleoperation on those targets. Following this roadmap yields OpenHLM, an open-source recipe for whole-body humanoid loco-manipulation. In a challenging long-horizon task that spans a wide vertical range of the humanoid, OpenHLM outperforms two state-of-the-art humanoid VLA baselines (GR00T N1.6 and $Ψ_0$) using less than half the total demonstration time. Our code, training data, and model checkpoints are available at [https://openhlm-project.github.io/].
Full Nonlinear Nonholonomic Dynamics and Motion Analysis of a 3-DoF Underactuated Spherical Rolling Robot
This paper presents a full nonlinear constrained dynamic model of MonoRollBot, a novel 3-DoF spherical rolling robot driven by a single motor, a lead-screw transmission, and a spring-coupled internal moving mass, together with motion analysis of its behavior. To the best of our knowledge, this is one of the first full nonlinear nonholonomic models reported for a mono-actuated, super-underactuated spherical rolling robot of this kind. Because rolling without slipping is nonholonomic, the dynamics are derived using the Lagrange--d'Alembert formulation, with the lead-screw relation imposed as a holonomic constraint and the rolling condition imposed in Pfaffian form. The formulation retains the complete generalized coordinates of shell translation, shell attitude, screw travel, nut rotation, and radial mass motion. Simulations and representative motion studies show qualitative agreement with prototype behavior and reveal how gravity, compliance, and inertia jointly shape the locomotion and motion capabilities of this strongly underactuated robot. The resulting model also provides a mechanically consistent basis for future state estimation and hybrid controller design for this nonholonomic mono-actuated rolling robot.
comment: 6 pages, Accepted to TAROS2026, Manchester
Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System
Reinforcement learning (RL) is a powerful and convenient tool to modernize controller design. In this work, we study the zero-shot transfer of RL-based control policies from simulation to hardware for cart-pole swing-up and stabilization. The two policies are trained independently, and the handoff is implemented in Simulink via switching logic. We apply a first-order action smoothing filter to prevent hardware damage from high-frequency oscillatory actuation. Pairing this bandwidth-aware filtering with sensitivity-guided domain randomization (DR) and a simple linear curriculum learning (CL) schedule, we obtain a swing-up policy that in all of our experiments injects sufficient energy for handoff into the stabilizer's region of attraction. The stabilization policy rejects disturbances within the tested range, and the swing-up policy can re-engage after larger perturbations and restores the pendulum to the inverted position.
Physics-Informed Eikonal Caging for Whole-Arm Manipulation Planning
Planning contact-rich whole-arm manipulation is challenging because interactions that involve extended robot geometry give rise to complex contact dynamics that are difficult to model accurately. This creates a need for planning principles that do not rely heavily on precise contact models. Caging offers one such geometric notion of robustness to modeling inaccuracy by restricting object escape through geometrically enclosing the object. However, existing caging formulations are difficult to incorporate into continuous optimization-based manipulation planning. We reformulate caging as a minimum-time escape problem in which the object seeks to leave an enclosing robot geometry in the shortest time. This yields a continuous escape-time field that measures the robot's enclosure quality and we show it satisfies an eikonal equation. We therefore can approximate this field using a physics-informed neural network, producing a smooth differentiable representation that can be embedded directly into manipulation planning. The resulting objective supports whole-arm manipulation planning to favor robot configurations resisting object escape. This improves the manipulation robustness to contact model mismatch, thus enabling planning with simplified contact models, including quasi-dynamic approximations and simplified object geometry. Across simulation and real-world experiments, we show improved robustness to disturbances and contact-model mismatch relative to baselines. These results suggest that geometric enclosure can serve as a practical robustness primitive for whole-arm manipulation. A supplementary video, which includes an intuitive overview of our method and experiment video results, is available on our project webpage.
comment: Under review
RoboLineage: Agent-Native Data Lifecycle Governance Across Robot Policy Iterations
We present RoboLineage, an agent-native data lifecycle governance system for robot policy iteration. Modern robot policies improve through repeated data collection, review, retraining, evaluation, and release decisions, but the evidence connecting these steps is often scattered across local tools, scripts, and expert memory. RoboLineage makes this lifecycle explicit by representing rollouts, reviews, dataset decisions, training runs, policy metadata, evaluations, deployment recommendations, and next-collection plans as typed lineage artifacts. Agents interpret embodied rollout evidence, adapt accepted data to existing training stacks, maintain data health, and summarize cross-iteration state under explicit artifact boundaries. In real-robot manipulation workflows, RoboLineage makes routine policy iteration faster and more auditable while maintaining downstream policy performance. We open source RoboLineage as a lightweight lifecycle layer for different robot embodiments and training families. Project page: https://robolineage.github.io/
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data
Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains. Videos and open-source code can be found on our project website: https://chenyt31.github.io/wh0.github.io/.
comment: Under review.The first three authors contributed equally to this work
Durability-Aware Multi-Objective Optimization of the Jansen Linkage: Trading Gait Quality Against Joint Wear
The Jansen linkage is a single-degree-of-freedom planar leg mechanism whose eleven "holy numbers" were evolved by Theo Jansen to optimize the foot-path gait alone, with no regard for the wear of its revolute joints. This paper introduces a durability objective into the design of the Jansen leg. A parametric forward-kinematic model (two-circle-intersection solver), an inverse-dynamic model (constraint-Jacobian / Lagrange-multiplier formulation of a seven-body, ten-joint system, independently cross-verified by a reduced-DOF energy method), and an Archard wear model are coupled to evaluate, for any set of link lengths, both gait quality and the per-cycle sliding wear at every pin. Because the wear is computed on ideal, clearance-free revolute joints, the resulting wear figures are a relative comparative ranking rather than an absolute life prediction. A bi-objective problem -- composite gait error versus total joint wear, subject to step-length, ground-clearance, duty-factor and assembly constraints -- is solved with NSGA-II. Under the adopted gait metric the classical Jansen design is Pareto-dominated: for a representative design, link-length adjustments within +/-29% simultaneously flatten the stance (-28%), smooth the stance velocity (-58%) and reduce total joint wear by ~56%. A sensitivity study shows the wear advantage is robust across a crank-speed x payload envelope (48%-56%) and identifies the link lengths that most strongly govern wear. A variance-based global (Sobol) analysis confirms that two link lengths dominate the wear variance, and a Monte-Carlo manufacturing-tolerance study shows the wear advantage degrades gracefully under realistic fabrication error. The framework provides a practical route to longer-lived walking linkages and a baseline for future wear-clearance-impact coupled studies.
comment: 15 pages, 10 figures, 7 tables
DeformX: A Versatile Co-Simulation Framework for Deformable Linear Objects IROS 2026
Deformable linear objects (DLOs) such as wires, cables, and ropes are common in robotic manipulation tasks, yet simulating them with both visual realism and physical accuracy remains challenging. Existing visual simulation methods typically rely on procedural geometric primitives that lack physically grounded deformation behavior, while physics-based approaches with robot learning support often approximate DLOs as rigid-link chains or generic soft bodies, failing to accurately capture the bending, twisting, and shear mechanics of slender elastic structures. In this work, we introduce DeformX, a co-simulation framework that integrates a dedicated Cosserat rod physics engine with NVIDIA Isaac Sim, enabling DLO simulations that are both physically faithful and visually realistic. Our Cosserat rod engine simulates the dynamics and self-collisions of DLOs, and contact interactions with arbitrary free-form meshes. To achieve high-fidelity visualization, we employ mesh skinning to map discrete rod deformations onto imported CAD models. To the best of our knowledge, DeformX is the one of the first frameworks for DLO simulation that unifies realistic visualization, principled physics, and compatibility with robot learning pipelines. We demonstrate its versatility across synthetic data generation and policy learning for DLO manipulation, and validate visual and physical fidelity through comparisons against real-world experiments. Notably, fine-tuning Segment Anything Model 3 (SAM3) on DeformX-generated data yields a 10.2% mAP@75 improvement in real-image wire segmentation, and a rope-swinging policy trained entirely in DeformX achieves a mean target-hitting error of 6.6 cm on a UR5e manipulator in real-world trials, highlighting its strong sim-to-real transfer capability.
comment: 11 pages, 11 figures, 5 tables, IROS 2026. Website: https://deformx.github.io/
KITE: Decoupling Kinematics and Interaction for Zero-Shot Cross-Embodiment Manipulation
Generalizing manipulation policies across robot embodiments remains difficult because standard policies entangle task reasoning with embodiment-specific motor control. We study zero-shot cross-embodiment manipulation, where a policy trained on source embodiments must be deployed on a structurally different target embodiment without additional task demonstrations. We introduce Kinematic Interaction Transfer across Embodiments (KITE), which decouples manipulation into embodiment-agnostic task reasoning and embodiment-specific motor control, connected through a learned latent representation of interaction intent based on contact patterns. Task reasoning is performed by a shared policy that predicts latent intents from source demonstrations, while motor control is performed by an intent-conditioned action decoder learned from each embodiment's kinematic model. With KITE, adaptation to a new embodiment requires only training a new action decoder using its kinematic model, without recollecting demonstration data. We evaluate KITE on three manipulation tasks spanning transfer between parallel grippers, dexterous hands, and composite embodiments. KITE consistently achieves zero-shot transfer to structurally different target embodiments, outperforming state-of-the-art baselines in transfer success and task-embodiment scope.
ACEsplat: Accelerated 3D Gaussian Scene Regression via RGB and Poses Only
Per-scene 3D Gaussian Splatting (3DGS) enables high-fidelity rendering, but practical robotic and AR scene capture pipelines often depend on external geometric initialization (e.g., SfM point clouds or depth estimates), which can be slow and brittle in on-site deployment. We present ACEsplat, a fast per-scene optimization framework that reconstructs 3D Gaussian representations from RGB images and camera poses only, without requiring external 3D priors (e.g., precomputed SfM models or supervised depth maps). ACEsplat uses a two-stage pipeline: (1) a self-supervised scene coordinate regression (SCR) module builds an internal geometry prior within 4--5 minutes; (2) SCR features and coordinate priors are fused by a lightweight Gaussian initialization head, followed by per-scene 3DGS optimization. On static-view rendering, ACEsplat achieves 29.11 dB PSNR on Wayspots with real-time SLAM poses and 33.20 dB on Cambridge Landmarks with SfM-refined poses. On RealEstate10K sparse-view novel view synthesis, it achieves competitive image fidelity under a challenging 2-view setting. ACEsplat completes scene-specific SCR mapping and 3DGS reconstruction within 15--25 minutes on a single GPU, making it a practical RGB+pose-only solution for rapid scene setup in robotics and mixed-reality applications.
Dynamics, stability, and energy efficiency of an energy-recycling rimless wheel with spring-clutch legs
This paper proposes an energy-recycling rimless wheel with spring-clutch legs. The proposed mechanism uses a lockable clutch to store part of the impact-induced elastic energy after foot contact and reinject it in the next gait cycle. First, we develop a hybrid dynamic model of the energy-recycling rimless wheel. Second, numerical simulations are used to examine the dynamics, local stability of periodic gaits, and the Cost of Transport (CoT) of the proposed mechanism. The simulation results show that the proposed mechanism reduces the CoT by up to 16.13% compared with a benchmark viscoelastic-legged rimless wheel with telescopic spring-damper legs. Compared with the rigid rimless wheel, the viscoelastic-legged and energy-recycling models reduce the CoT by more than 50%. The energy-recycling model also maintains locally stable periodic gaits over the tested slope and stiffness ranges. Finally, prototype experiments on an inclined plane are conducted to examine the feasibility of the proposed mechanism. The experimental results show that the proposed rimless wheel achieves passive walking on a shallow 1° slope, corresponding to a CoT of approximately 0.02. These results suggest that the proposed spring-clutch mechanism can improve the simulated walking efficiency of the energy-recycling rimless wheel, while the prototype experiments support the feasibility of passive walking with the mechanism.
comment: 14 pages, 14 figures. Equal contribution: Tongchen Lin and Yanqiu Zheng. Corresponding author: Mingyi Liu (mingyi.liu@gtiit.edu.cn)
How Should a Simulation-to-Reality Transfer Budget Be Spent?
Simulation-to-reality transfer, often called sim-to-real transfer, is a central challenge in robot learning. Yet, the tradeoff between measuring a system more accurately and training over a broader range of simulated dynamics is still poorly understood. In this work, we focused on the allocation of real-robot measurement time between system identification and domain randomization. We studied this tradeoff in a controlled sim-to-sim pendulum setting, where a hidden-parameter model stands in for the physical robot, and the experiment sweeps identification rollouts against the width of the randomization distribution. Across the reality gaps and noise levels we tested, the measurement budget did most of the work. A small number of identification rollouts closed most of the transfer gap, and once any real data was available, policies performed best when trained at the estimated parameters rather than over a widened randomization band. Broad randomization that contained the true system still did not substitute for measurement. These results hold in a benign regime where the dynamics are identifiable and only two parameters are unknown, so structural model mismatch remains the setting where randomization breadth may become more valuable. Overall, our results suggest that sim-to-real pipelines should first measure the parameters they can and reserve randomization for the uncertainty that remains.
comment: Both authors contributed equally and share first authorship
Multi-Target Maneuver Coordinations: Unlocking Coordination Opportunities in Connected Automated Driving
Maneuver coordination is a key enabler of connected and automated driving, allowing vehicles to negotiate and execute maneuvers that would otherwise be difficult, inefficient or unsafe. Existing approaches and use cases typically assume coordination with a single predefined target vehicle, which limits the number of coordination opportunities. This paper introduces a maneuver coordination approach based on multi-target selection, which allows a vehicle to identify and select among multiple potential coordination vehicles for a given maneuver. Multi-target maneuver coordination does not require modifications to the maneuver execution logic or to the underlying coordination protocol. Instead, it extends the decision-making process preceding coordination, enabling vehicles to exploit a broader set of feasible cooperative interactions. Results show that multi-target maneuver coordination significantly increases triggered and successfully executed coordinations while maintaining a low computational cost, as the proposed approach achieves these gains without requiring the analysis of a large number of potential target vehicles. These improvements preserve coordination success rates while enabling earlier maneuver initiation.
GeoFlow-SLAM++: A Robust Multi-Camera Visual-Inertial SLAM System with Relocalization
Monocular and RGB-D visual-inertial SLAM systems remain susceptible to limited field of view, sensor-specific failure modes, and unreliable cross-session relocalization. To address these issues, we present GeoFlow-SLAM++, a tightly coupled multi-camera visual-inertial SLAM system that extends GeoFlow-SLAM from a single RGB-D sensor to a calibrated multi-camera rig with a unified body-centric formulation. Within this multi-camera framework, GeoFlow-SLAM++ supports two interchangeable visual front-ends: a conventional ORB front-end and a neural network feature (NN-Feature) front-end built on SuperPoint and LightGlue. The system unifies tracking, mapping, and relocalization on a shared body state, and combines multi-camera reprojection constraints, IMU pre-integration, cross-view place recognition, and dual-stream optical flow/NN-Feature tracking for robust localization. As an optional extension, the system can further incorporate cross-view-consistent pseudo-depth predictions from RGB images as auxiliary geometric constraints. We evaluate GeoFlow-SLAM++ on EuRoC, OpenLORIS, TUM, Hilti, and a self-collected handheld multi-camera dataset. Results show that the NN-Feature front-end improves robustness in appearance-challenging scenarios, the multi-camera formulation achieves competitive localization accuracy on Hilti, and the unified cross-view relocalization design reaches LiDAR-comparable performance on the handheld dataset.
comment: 10 pages
A Multimodal Tiltwing Framework for Bioinspired Aerial Robots
Tailless flapping-wing micro-aerial vehicles (FWMAVs) mimic the impressive flight performance of hummingbirds, utilising unsteady aerodynamic effects. However, existing designs are still limited and purpose-built with a restricted flight envelope and poor endurance. We therefore propose an adaptable tiltwing framework enabling bioinspired aerial robots to switch between hovering flight, high-speed directional flight, and energy-efficient gliding flight. The proposed framework utilises thrust vectoring with a wide actuation range via two fully independent propulsion units, each flapping a single wing, for effective control and enhanced manoeuvrability. For this, we developed a hybrid Scotch-yoke-based flapping mechanism that ensures a symmetric motion profile with a modular design guaranteeing an arbitrarily wide flapping angle to exploit the lift-enhancing clap-and-fling effect. Additionally, we implemented a passive wing-rotation mechanism, which, in combination with our dual-wing thrust-vectoring approach, allows unprecedented wing-design freedom, unlocking potential for precise optimisation. A contactless leading-edge tracking sensor provides accurate feedback on the wing's orientation and, in the gliding mode, enables dihedral-angle control, augmenting the active wing-pitch control. Extensive testing of a propulsion unit was conducted with a six-axis force/torque sensor, demonstrating the flapping mechanism's performance while optimising transmission efficiency and the passive wing-pitch mechanism. At full throttle, the average lift force generated by a single wing, flapping with a 188° amplitude, was 21.1 gf for a small 3.1 g 1S BLDC motor. Additional tests covering the full range of the wide-angle tilting capability showed an effective thrust-vectoring control architecture with a linear and symmetric response curve of the moments generated.
comment: 18 pages, 23 figures
Deep RL- Tuned Mo del-Free Adaptive Control for Lower-Limb Exoskeletons During Sit-to-Stand Transitions
Sit-to-stand (STS) transitions impose significant joint-loading demands on elderly individuals, making them a primary target for lower-limb exoskeleton assistance. However, accurate trajectory tracking during STS is challenging due to complex, time-varying human exoskeleton interaction dynamics and inter-subject variability that render model-based control approaches difficult to apply in practice. This paper presents an intelligent model free adaptive backstepping control strategy for a bilateral lower-limb exoskeleton during STS motion. The proposed controller design uses an ultra-local second-order model to avoid explicit system identification, while a Gaussian radial basis function (RBF) neural network estimates the unknown lumped dynamics online. To further improve phase-aware tracking performance, a Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning agent is integrated as a supervisory gain scheduler that adaptively adjusts controller gains across the distinct phases of STS motion. The proposed controller is evaluated through co-simulation in MATLAB/Simulink and Simscape Multibody using OpenSim-derived reference trajectories and benchmarked against state-of-the-art controllers. Results demonstrate that the proposed controller achieves the lowest average RMSE of 0.078 degree across all joints, representing improvements of 60.2%, 54.4%, 48.7%, and 42.6% over proportional integral derivative (PID), model-free adaptive control (MFAC), linear quadratic regulator (LQR), and sliding-mode control (SMC), respectively. TD3 integration further reduces tracking error by 35%, 33%, and 79% at the hip, knee, and ankle joints compared to the standalone RBF-MFAC baseline. These results demonstrate the effectiveness and robustness of the proposed controller design for assistive exoskeleton control during STS transitions.
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation
Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.
CoRDE: Concept-Prior Routed Diffusion Experts for Structural Generalization in Robot Manipulation IROS
Diffusion models excel at capturing multi-modal action distributions in robot imitation learning. However, in multi-task and long-horizon scenarios, monolithic architectures lack structural generalization capabilities, suffering from gradient conflicts between distinct semantic sub-stages. While pure data-driven Mixture-of-Experts (MoE) methods introduce labor division, they frequently trigger routing collapse, and instantiating full-scale experts causes parameter explosion and high expansion costs. To address these issues, we propose Concept-prior Routed Diffusion Experts (CoRDE), a structure-guided variational distillation framework. CoRDE extracts semantic distributions from a frozen concept encoder to guide the variational posterior responsibility via a learnable soft mapping matrix. This mechanism introduces an entropy-controlled responsibility inference process that encourages confident routing under reliable semantic predictions while preserving the stochastic diffusion term for behavioral diversity. To overcome parameter inflation, CoRDE employs a parameter-efficient expert pool using Low-Rank Adaptation (LoRA) on a shared frozen backbone. Theoretical analysis shows that the mixture score discrepancy is bounded by responsibility-weighted local expert errors, supporting high-fidelity generation under low-rank expert adaptation. Empirical evaluations confirm that, compared to existing baselines, CoRDE systematically reduces routing collapse, forming robust, semantically aligned expert allocations while achieving superior action quality and incremental learning efficiency.
comment: 8 pages, 3 figures, Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Predictive Gaze Is Preserved but Reorganized toward Monitoring during Robot-Mediated Manipulation
Goal-directed eye movements are a fundamental component of visuomotor control, enabling humans to anticipate and guide their actions. Whether this anticipatory and task-driven behavior is preserved when actions are executed through a robot rather than through one's own body remains unclear. Here we address this question by investigating gaze behavior during goal-directed telemanipulation to determine how visuomotor control adapts to altered embodiment. Our findings show that gaze remains strongly aligned with task goals, preserving its predictive role even during robot-mediated manipulation. At the same time, teleoperation systematically redistributes visual attention toward the robotic end-effector and manipulated objects, increasing online monitoring. These findings show that predictive gaze is not lost under altered embodiment, but reorganized in response to changes in sensory feedback and control demands. More broadly, they reveal the flexibility of the human visuomotor system when the natural sensorimotor coupling is disrupted and identify gaze as an informative signal for inferring action intentions in human-robot interaction.
SurGE: Surrogate Gradient-guided Evolution for Co-design of Legged Robots with Parallel Elasticity IROS 2026
Co-design of legged robots with elastic elements is challenging due to the non-differentiability of contact dynamics and mechanism engagement. This paper presents SurGE, a framework that computes surrogate gradients of the design objective through a differentiable pipeline consisting of a kinodynamic single-rigid-body (Kino-SRB) model and a design-aware control policy, and injects them into CMA-ES via mean shift with cosine-annealed step decay. On a 4-DOF design space of a hopping robot with unidirectional parallel spring, SurGE achieves 6 times lower cross-seed standard deviation and 18% tighter population concentration compared to vanilla CMA-ES, while matching or improving the best objective. Hardware experiments on a 2D design subspace show that, starting from a hand-tuned initial design, SurGE reduces the design objective by 37.65% on hardware, with the improvement trend identified in simulation transferring consistently to the physical system. SurGE provides the potential to accelerate non-differentiable co-design problems in legged robots via surrogate model gradients.
comment: 8 pages, 7 figures. Accepted for publication at IROS 2026. Website at https://arcad-lab-um.github.io/surge-codesign/
VLA Knows Its Limits: Adaptive Execution Horizons for Robot Policies ECCV 2026
Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention weights as a proxy for the model's predictive limit and propose AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions. Across simulated and real-world robotic manipulation tasks, AutoHorizon is performant, incurs negligible computational overhead, and generalizes across diverse tasks and flow-based models.
comment: ECCV 2026. Project page at https://hatchetproject.github.io/autohorizon/
IMPACT: An Implicit Active-Set Augmented Lagrangian for Fast Contact-Implicit Trajectory Optimization
Contact-implicit trajectory optimization (CITO) has attracted growing attention as a unified framework for planning and control in contact-rich robotic tasks. Recent approaches have demonstrated promising results in manipulation and locomotion without requiring a prescribed contact-mode schedule. It is well known that the underlying mathematical programs with complementarity constraints (MPCCs) remain numerically ill-conditioned, and systematic, scalable solution strategies for CITO remain an active area of research. More efficient and principled solvers that can handle contact constraints are therefore essential to broaden the applicability of CITO. In this work, we develop an augmented-Lagrangian approach to CITO for solving MPCC-based CITO with stationarity guarantees. The method can be interpreted as identifying the implicit contact-mode branches on the fly during the trajectory optimization (TO) iterations; we call this approach IMPACT (IMPlicit contact ACtive-set Trajectory optimization). We provide an efficient C++ implementation tailored to trajectory-optimization workloads and evaluate it on the open-source CITO and contact-implicit model predictive control (CI-MPC) benchmarks. On CITO, IMPACT achieves 2.9x-70x speedups over strong baselines (geometric mean 13.8x). On CI-MPC, we show improved control quality for contact-rich trajectories on dexterous manipulation tasks in simulation. Finally, we demonstrate the proposed method on real robotic hardware on a T-shaped object pushing task.
comment: Accepted to Robotics: Science and Systems (RSS), 2026
Iterative Convex Optimization with Control Barrier Functions for Obstacle Avoidance among Polytopes
Obstacle avoidance of polytopic obstacles by polytopic robots is a challenging problem in optimization-based control and trajectory planning. Many existing methods rely on smooth geometric approximations, such as hyperspheres or ellipsoids, which allow differentiable distance expressions but distort the true geometry and restrict the feasible set. Other approaches integrate exact polytope distances into nonlinear model predictive control (MPC), resulting in nonconvex programs that limit real-time performance. In this paper, we construct linear discrete-time control barrier function (DCBF) constraints by deriving supporting hyperplanes from exact closest-point computations between convex polytopes. We then propose a novel iterative convex MPC-DCBF framework, where local linearization of system dynamics and robot geometry ensures convexity of the finite-horizon optimization at each iteration. The resulting formulation reduces computational complexity and enables fast online implementation for safety-critical control and trajectory planning of general nonlinear dynamics. The framework extends to multi-robot and three-dimensional environments. Numerical experiments demonstrate collision-free navigation in cluttered maze scenarios with millisecond-level solve times.
comment: 8 pages, 4 figures
WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League
We introduceWorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints, a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.
DASIP: Dynamic Test-Time Compute Scaling for Robot Control with Stochastic Interpolant Policies
Diffusion- and flow-based policies deliver state-of-the-art performance on long-horizon robotic manipulation and imitation learning tasks. However, these controllers employ a fixed inference budget at every control step, regardless of task complexity, leading to computational inefficiency for simple subtasks while potentially underperforming on challenging ones. To address these issues, we introduce Difficulty-Aware Stochastic Interpolant Policy (DA-SIP), a framework that enables robotic controllers to adaptively adjust their integration horizon in real time based on task difficulty. Our approach employs a difficulty classifier that analyzes observations to dynamically select the step budget, the optimal solver variant, and ODE/SDE integration at each control cycle. DA-SIP builds upon the stochastic interpolant formulation to provide a unified framework that unlocks diverse training and inference configurations for diffusion- and flow-based policies. Through comprehensive benchmarks across diverse manipulation tasks, DA-SIP achieves 2.6-4.4x reduction in total computation time while maintaining task success rates comparable to fixed maximum-computation baselines. By implementing adaptive computation within this framework, DA-SIP transforms generative robot controllers into efficient, task-aware systems that intelligently allocate inference resources where they provide the greatest benefit.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
Towards a Novel Wearable Robotic Vest for Hemorrhage Suppression
This paper introduces a novel robotic system designed to manage severe bleeding in emergency scenarios, including unique environments like space stations. The robot features a shape-adjustable "ring mechanism", transitioning from a circular to an elliptical configuration to adjust wound coverage across various anatomical regions. We developed various arms for this ring mechanism with varying flexibilities to improve adaptability when applied to non-extremities of the body (abdomen, back, neck, etc.). To apply equal and constant pressure across the wound, we developed an inflatable ring and airbag balloon that are compatible with this shape-changing ring mechanism. A series of experiments focused on evaluating various ring arm configurations to characterize their bending stiffness. Subsequent experiments measured the force exerted by the airbag balloon system using a digital scale. Despite its promising performance, certain limitations related to coverage area are identified. The shape-changing effect of the device is limited to scenarios involving partially inflated or deflated airbag balloons, and cannot fully conform to complex anatomical regions. Finally, the device was tested on casualty simulation kits, where it successfully demonstrated its ability to control simulated bleeding.
Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning
We propose a new Verbal Reinforcement Learning (VRL) framework for interpretable task-level planning in mobile robotic systems operating under execution uncertainty. The framework follows a closed-loop architecture that enables iterative policy improvement through interaction with the physical environment. In our framework, executable Behavior Trees are repeatedly refined by a Large Language Model actor using structured natural-language feedback produced by a Vision-Language Model critic that observes the physical robot and execution traces. Unlike conventional reinforcement learning, policy updates in VRL occur directly at the symbolic planning level, without gradient-based optimization. This enables transparent reasoning, explicit causal feedback, and human-interpretable policy evolution. We validate the proposed framework on a real mobile robot performing a multi-stage manipulation and navigation task under execution uncertainty. Experimental results show that the framework supports explainable policy improvements, closed-loop adaptation to execution failures, and reliable deployment on physical robotic systems.
Learned Controllers for Agile Quadrotors in Pursuit-Evasion Games
In this letter we study 1v1 quadrotor pursuit-evasion, where a pursuer and an evader are trained via reinforcement learning (RL) by competing against each other. Such adversarial settings face well-known challenges: each agent's policy changes during training, creating a non-stationary environment; agents might overfit to the current opponent and forget earlier strategies (catastrophic forgetting); and the competitive dynamics can cause strategy cycling or policy collapse. To address these issues, we propose Asynchronous Multi-Stage Population-Based training with Hedge sampling (AMSPBH), a method based on Policy-Space Response Oracles (PSRO) and adapted to quadrotor RL control. PSRO maintains a population of previously trained policies and trains new approximate best responses against mixtures of that population instead of against a single opponent. In AMSPBH, each generation trains one agent with Proximal Policy Optimization (PPO) against frozen opponent policies, while a Hedge sampler assigns higher probability to opponents that are currently difficult to beat. We show that: (i) AMSPBH discovers new strategies while retaining competence against older opponents, reaching a regime where additional best-response training gives limited improvement; (ii) compared to training against only the latest opponent, population-based training generalizes better across diverse and unseen strategies; and (iii) the learned population contains distinct pursuit and evasion behaviors, providing useful strategic diversity for finding weaknesses and improving controller robustness. We validate the trained policies with hardware experiments on Crazyflie brushless quadrotors, showing zero-shot sim-to-real transfer of agile, reactive pursuit-evasion behavior against both handcrafted and learned adversaries, with physical flights reaching up to 4.87 m/s.
comment: Under review
OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation
A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.
comment: 8 pages main text, appendices included
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.
Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving
Planning under uncertainty for real-world robotics tasks, such as autonomous driving, requires reasoning in enormous high-dimensional belief spaces, rendering the problem computationally intensive. While parallelization offers scalability, existing hybrid CPU-GPU solvers face critical bottlenecks due to host-device synchronization latency and branch divergence on SIMT architectures, limiting their utility for real-time planning and hindering real-robot deployment. We present Vec-QMDP, a CPU-native parallel planner that aligns POMDP search with modern CPUs' SIMD architecture, achieving $227\times$--$1073\times$ speedup over state-of-the-art serial planners. Vec-QMDP adopts a Data-Oriented Design (DOD), refactoring scattered, pointer-based data structures into contiguous, cache-efficient memory layouts. We further introduce a hierarchical parallelism scheme: distributing sub-trees across independent CPU cores and SIMD lanes, enabling fully vectorized tree expansion and collision checking. Efficiency is maximized with the help of UCB load balancing across trees and a vectorized STR-tree for coarse-level collision checking. Evaluated on large-scale autonomous driving benchmarks, Vec-QMDP achieves state-of-the-art planning performance with millisecond-level latency, establishing CPUs as a high-performance computing platform for large-scale planning under uncertainty.
ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training
We study how to improve large foundation vision-language-action (VLA) systems through human-in-the-loop reinforcement learning (RL) in real-world environments. A key challenge is learning reliable value functions from heterogeneous real-world experience, as value estimation provides the primary learning signal for VLA training. In practice, replay buffers contain trajectories collected from historical policies, online rollouts, demonstrations, and intermittent human interventions. Because replay buffers mix trajectories generated by different behaviors, the observed returns can be mismatched with the quality of the current policy. Prior VLA post-training methods often rely on progress-style value signals, which reflect the average quality of historical behaviors, leading to mismatched learning signals for the current policy. In this paper, we propose ALOE, an off-policy evaluation framework whose value function directly evaluates current-policy behavior for each iteration. Specifically, ALOE combines chunked temporal-difference bootstrapping and conservative value aggregation to perform stable current-policy evaluation, then uses these estimates for advantage-weighted policy improvement. This design improves credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate ALOE on four real-world manipulation tasks encompassing long-horizon and high-precision scenarios: smartphone packing, laundry folding, multi-object sorting, and phone assembly. Across all tasks, ALOE outperforms other VLA post-training methods, highlighting the benefit of off-policy value estimates for real-world VLA post-training. Videos are available at our project website https://rooshy-yang.github.io/aloe.
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.
KiVi: Kinesthetic-Visuospatial Integration for Dynamic and Safe Egocentric Legged Locomotion
Vision-based locomotion has shown great promise in enabling legged robots to perceive and adapt to complex environments. However, visual information is inherently fragile, being vulnerable to occlusions, reflections, and lighting changes, which often cause instability in locomotion. Inspired by animal sensorimotor integration, we propose KiVi, a Kinesthetic-Visuospatial integration framework, where kinesthetics encodes proprioceptive sensing of body motion and visuospatial reasoning captures visual perception of surrounding terrain. Specifically, KiVi separates these pathways, leveraging proprioception as a stable backbone while selectively incorporating vision for terrain awareness and obstacle avoidance. This modality-balanced, yet integrative design, combined with memory-enhanced attention, allows the robot to robustly interpret visual cues while maintaining fallback stability through proprioception. Extensive experiments show that our method enables quadruped robots to stably traverse diverse terrains and operate reliably in unstructured outdoor environments, remaining robust to out-of-distribution(OOD) visual noise and occlusion unseen during training, thereby highlighting its effectiveness and applicability to real-world legged locomotion. Project Page: https://marmotlab.github.io/kivi-quadruped/
comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
NeuPAN: Direct Point Robot Navigation with End-to-End Model-based Learning
Navigating a nonholonomic robot in a cluttered, unknown environment requires accurate perception and precise motion control for real-time collision avoidance. This paper presents NeuPAN: a real-time, highly accurate, map-free, easy-to-deploy, and environment-invariant robot motion planner. Leveraging a tightly coupled perception-to-control framework, NeuPAN has two key innovations compared to existing approaches: 1) it directly maps raw point cloud data to a latent distance feature space for collision-free motion generation, avoiding error propagation from the perception to control pipeline; 2) it is interpretable from an end-to-end model-based learning perspective. The crux of NeuPAN is solving an end-to-end mathematical model with numerous point-level constraints using a plug-and-play (PnP) proximal alternating-minimization network (PAN), incorporating neurons in the loop. This allows NeuPAN to generate real-time, physically interpretable motions. It seamlessly integrates data and knowledge engines, and its network parameters can be fine-tuned via backpropagation. We evaluate NeuPAN on a ground mobile robot, a wheel-legged robot, and an autonomous vehicle, in extensive simulated and real-world environments. Results demonstrate that NeuPAN outperforms existing baselines in terms of accuracy, efficiency, robustness, and generalization capabilities across various environments, including the cluttered sandbox, office, corridor, and parking lot. We show that NeuPAN works well in unknown and unstructured environments with arbitrarily shaped objects, transforming impassable paths into passable ones.
comment: Accepted by TRO 2025; project website: https://hanruihua.github.io/neupan_project/
Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion
Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.
Unified Motion-Action Modeling for Heterogeneous Robot Learning
We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.
comment: https://uma-manipulation.github.io/
Multiagent Systems
Revelio: Cost-Efficient Agentic Memory Safety Vulnerability Detection For Repository-Scale Codebases
Memory safety vulnerabilities remain a significant threat even for projects with extensive fuzzing and manual auditing. Recent results suggest that large language models hold great promise for detecting such vulnerabilities, but they are unreliable, at risk of hallucination, and challenging to scale to repository-size codebases. This paper presents Revelio, a cost-efficient end-to-end agentic framework for memory-safety vulnerability discovery. Revelio addresses the problem of hallucination by generating an executable Proof-of-Vulnerability, which is checked with a deterministic sanitizer. It reduces cost using inexpensive LLMs and lightweight static analysis to help generate and rank vulnerability hypotheses, reporting vulnerabilities only when they can be reproduced and confirmed by a sanitizer. We evaluated Revelio on seven production-quality projects that had been continuously fuzzed for five to eight years, as well as on 100 randomly selected Arvo projects from the CyberGym benchmark. With around one hour per project and a total cost of $300, Revelio discovered 19 previously unknown memory-safety vulnerabilities. On benchmarks, Revelio outperformed frontier coding agents across diverse backbone models at comparable token costs. Our results suggest that Revelio enables scalable and trustworthy end-to-end LLM-based memory-safety vulnerability detection.
When Is Emergent Consensus Real? A Measured Coupling Gain and a Validity Diagnostic for LLM Agent Societies
LLM "agent societies" are studied via demonstrations of emergent consensus or polarization -- with no measurable control parameter, no theory of when each regime appears, and no test of whether an outcome is a genuine social dynamic or a model artifact. We introduce the coupling gain gamma, measured per-agent by counterfactually perturbing a neighbour's stated opinion. (i) gamma is stable and model-distinguishing -- across five frontier models it spans 0.15-0.43 (n=20, 95% CIs <= 0.025), paraphrase-invariant; social-neighbour gamma roughly equals numeric-anchor gamma, so gamma is evidence-coupling, not uniquely social. (ii) Classical dynamics with measured (not assumed) coefficients organise the regime: Friedkin-Johnsen for consensus/pluralism, signed-Laplacian/structural-balance for polarization. (iii) Frontier LLMs do not spontaneously backfire (beta <= 0), so default societies do not self-polarize -- polarization is always induced; the beta>0 branch arises only in the FJ surrogate, never in the agents. (iv) A randomized-initial-condition diagnostic -- the (slope, bias) of final vs. initial opinion -- separates genuine averaging from model-prior artifacts (boundary-censoring ruled out by construction via interior-valued facts); applied to a published "emergent consensus" result (Chuang et al. 2023) it reveals a model-specific conflation: averaging on debatable claims, prior-artifact on settled facts. (v) Coupling is context-dependent: pairwise gamma does not predict multi-neighbour outcomes -- it can order them backwards -- whereas a modality-matched group coupling does (sixteen closed+open models, Pearson r=-0.70, permutation p=0.008). The regime laws take this matched coupling, not the single-neighbour gamma: emergent consensus must be read from coupling in the target interaction. We contribute a measurement protocol and a validity instrument, not new theory.
comment: 13 pages (incl. appendix with proofs), 7 figures. Code and per-run logs released
Social Reality Construction via Active Inference: Modeling the Dialectic of Conformity and Creativity
Social agents both internalize collective norms and reshape them through creative action, yet computational models have not captured this bidirectional process within a unified framework. We propose a multi-agent simulation model grounded in active inference that formalizes the dialectical constitution of social reality on a structured social network. Each agent maintains an internal generative model, communicates with neighbors to form social priors, creates novel observations, and selectively incorporates others' creations into memory. Simulation experiments demonstrate three main findings. First, informationally cohesive social groups emerge endogenously, with representational alignment mirroring the cluster topology of the underlying network. Second, a circular mutual constitution arises between social representations and the observation distribution, maintained through agents' creative acts that project representational structure onto the external world. Third, the propagation of creations exhibits selective, heterogeneous patterns distinct from the stable diffusion of social representations, indicating that agents construct cultural niches through local interaction dynamics. These results suggest that the interplay between social conformity and creative deviation can give rise to the endogenous formation and differentiation of shared social reality.
comment: Accepted at ALIFE 2026 conference
Scaling Small Agents Through Strategy Auctions ICML 2026
Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.
comment: ICML 2026
Hierarchical Adversarial Bandits for Online Configuration Optimization KDD 2026
Motivated by Online Configuration Optimization in large, dynamic parameter spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB (Adversarial Bandit over Bandits), a hierarchical framework that decomposes the configuration space into clusters to accelerate learning and adapt to changing environments. We evaluate ABoB using standard algorithms such as EXP3 and Tsallis-INF on a real-world production storage system, demonstrating significant performance gains of up to $50\%$ compared to state-of-the-art "flat" bandit algorithms. Extensive simulations further confirm that ABoB effectively exploits metric structures, achieving up to $91\%$ improvement in adversarial metric scenarios while significantly reducing computational running time. Theoretical analysis grounds this empirical success: we prove that ABoB maintains a worst-case "safety net" bound of $O(\sqrt{kT})$, matching traditional methods, where $T$ is the number of rounds and $k$ is the number of arms, while capable of accelerating learning to $O(k^{1/4}\sqrt{T})$ under favorable Lipschitz conditions. This combination of operational efficiency and theoretical soundness makes ABoB a practical solution for automated system tuning.
comment: Paper was accepted to ACM SIGKDD 2026
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis KDD 2026
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
comment: KDD 2026
HEAS: Hierarchical Evolutionary Agent-Based Simulation Framework for Multi-Objective Policy Search
HEAS is a Python framework that connects agent-based simulation, evolutionary search, and scenario-based evaluation in a single reproducible pipeline. It is designed for researchers who study systems where local interactions produce system-level outcomes-ecosystems, organizations, markets, or regulatory environments-and who need to search over candidate strategies and compare them across uncertain scenarios. HEAS combines three modules: a hierarchy runtime for composing simulations from reusable process layers, an evolutionary tuner for single- or multi-objective search backed by DEAP, and a game module for evaluating strategies across scenario ensembles. Its central design principle is the "metric contract": the same outcome function is shared by optimization, evaluation, and validation, so that different parts of an analysis cannot silently rank strategies by different quantities.
comment: 6 pages, 2 figures. Python package: https://pypi.org/project/heas/ | Web playground: https://ryzhanghason.github.io/heas/
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.
Systems and Control (EESS)
Regret-Guaranteed Safe Switching: LQR Setting with Unknown Dynamics
We consider learning-based control in LQR setting, where the parameters associated with each mode are a priori unknown. The next mode to be activated is revealed online only at the time of switching. The objective is to determine both the switching times and the control gains for each mode such that (1) the norm of the system state remains bounded according to a prescribed criterion, and (2) the accumulated cost is minimized. To formalize the state-norm requirement, we introduce the notion of $(α,β)$-controllability for given parameters $α$ and $β$. We first study the problem in a known model setting and show that, under the switching mechanism described above and under the assumption that each mode is visited infinitely often, the strategy that minimizes the average expected cost consists of applying, in each mode, the feedback gain obtained from the solution of the discrete algebraic Riccati equation, while selecting dwell times that sufficiently satisfy the controllability condition. We refer to this strategy as the benchmark policy. Next, we propose an algorithm for the unknown-model setting that minimizes the regret, defined as the difference between the cumulative cost incurred by the online algorithm and that of the offline benchmark. By accurately estimating dwell-time errors, our method achieves an expected regret of $\mathcal{O}(|\mathcal{M}|^{1/4} n_s^{3/4} + n_m)$, where $n_s$ denotes the number of switches, $|\mathcal{M}|$ is the number of modes, and $n_m$ is the number of malignant switches.
Monotonic, Minimum-Settling-Time PI Tuning for First-Order-Plus-Dead-Time Plants: A Tangency Characterization
This paper studies PI tuning of a first-order-plus-dead-time (FOTD) plant for the fastest strictly monotone (zero-overshoot) setpoint step response, with monotonicity imposed on the plant output only. The minimizer is shown to be neither the pole-zero cancellation design nor the multiple real dominant pole (MRDP) design. It is a non-cancellation point at which the closed loop carries a slow real mode of small residue together with a faster underdamped complex pair, with the controller zero placed near the dominant real pole. The analytical centerpiece is a single tangency identity, tan(omega tau_star + alpha) = (a - b)/omega, which states that the monotonicity boundary is the locus where the secondary complex mode just fails to drive the output slope below zero. From this identity the design reduces to nested scalar conditions, realized at three levels of fidelity: an explicit closed-form rule, an exact response-based reduction, and a simulation-free transcendental system whose only non-elementary step is a fourth-order polynomial root. Relative to the critically damped Lambert-W cancellation rule the design reduces the 2% settling time by 14 to 52 percent and lowers the load integrated absolute error by 5 to 38 percent. We report the full cost: for delay-dominated plants (T/L <= 0.55) the control stays one-pulse, so the design is itself admissible in Huba's sense and merely faster, but for larger lag ratios the control becomes two-pulse, and across the range the maximum sensitivity rises from 1.39 to between 1.44 and 1.62. The contribution is positioned not as a uniformly better tuning but as the exact characterization of a specific, well-defined operating point, with an honest multi-metric comparison against established rules.
Information Design under Uncertain Utilities: Probabilistic and CVaR Approaches
This paper studies information design when the designer lacks precise knowledge of agents' payoff coefficients. The Calibrated Bayes Correlated Equilibrium (Cal-BCE) is introduced as a solution concept that augments the Bayes correlated equilibrium with a corrector policy preserving incentive compatibility under the designer's structural uncertainty, adapting its revelation principle to this setting. The design problem is nonconvex in general, but under a linear-quadratic-Gaussian structure it admits convex second-order cone and semidefinite reformulations under two-sided probabilistic and conditional value-at-risk (CVaR) constraints, with feasibility guaranteed by a Hadamard invertibility condition. A joint decentralization theorem shows that both designs cap cross-agent action covariances, the CVaR design more tightly at a common tolerance; but because the formulations operate at design-specific feasibility thresholds, the realized ordering is calibration-dependent. Experiments on fifteen sector ETFs confirm the trade-off: the probabilistic design attains higher mean welfare and the CVaR design better tail protection, with neither dominating outright.
comment: 37 pages, 9 figures
Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System
Reinforcement learning (RL) is a powerful and convenient tool to modernize controller design. In this work, we study the zero-shot transfer of RL-based control policies from simulation to hardware for cart-pole swing-up and stabilization. The two policies are trained independently, and the handoff is implemented in Simulink via switching logic. We apply a first-order action smoothing filter to prevent hardware damage from high-frequency oscillatory actuation. Pairing this bandwidth-aware filtering with sensitivity-guided domain randomization (DR) and a simple linear curriculum learning (CL) schedule, we obtain a swing-up policy that in all of our experiments injects sufficient energy for handoff into the stabilizer's region of attraction. The stabilization policy rejects disturbances within the tested range, and the swing-up policy can re-engage after larger perturbations and restores the pendulum to the inverted position.
Durability-Aware Multi-Objective Optimization of the Jansen Linkage: Trading Gait Quality Against Joint Wear
The Jansen linkage is a single-degree-of-freedom planar leg mechanism whose eleven "holy numbers" were evolved by Theo Jansen to optimize the foot-path gait alone, with no regard for the wear of its revolute joints. This paper introduces a durability objective into the design of the Jansen leg. A parametric forward-kinematic model (two-circle-intersection solver), an inverse-dynamic model (constraint-Jacobian / Lagrange-multiplier formulation of a seven-body, ten-joint system, independently cross-verified by a reduced-DOF energy method), and an Archard wear model are coupled to evaluate, for any set of link lengths, both gait quality and the per-cycle sliding wear at every pin. Because the wear is computed on ideal, clearance-free revolute joints, the resulting wear figures are a relative comparative ranking rather than an absolute life prediction. A bi-objective problem -- composite gait error versus total joint wear, subject to step-length, ground-clearance, duty-factor and assembly constraints -- is solved with NSGA-II. Under the adopted gait metric the classical Jansen design is Pareto-dominated: for a representative design, link-length adjustments within +/-29% simultaneously flatten the stance (-28%), smooth the stance velocity (-58%) and reduce total joint wear by ~56%. A sensitivity study shows the wear advantage is robust across a crank-speed x payload envelope (48%-56%) and identifies the link lengths that most strongly govern wear. A variance-based global (Sobol) analysis confirms that two link lengths dominate the wear variance, and a Monte-Carlo manufacturing-tolerance study shows the wear advantage degrades gracefully under realistic fabrication error. The framework provides a practical route to longer-lived walking linkages and a baseline for future wear-clearance-impact coupled studies.
comment: 15 pages, 10 figures, 7 tables
On the connection between input-output resonances and internal modes of linear time-invariant systems
It is shown that in general, there is no connection between the location of the internal modes of a Linear Time-Invariant (LTI) system and the shape of its input-output frequency response. In particular, it is shown that resonance peaks of the frequency response do not necessarily correspond to under-damped internal modes. This phenomenon, though rare, can occur in high (or infinite) dimensional LTI systems. In the Single Input Single Output (SISO) case, this phenomenon can be attributed to the location of system zeros, while in certain Multi Input Multi Output (MIMO) cases without system zeros, it can be attributed to the non-normality of the matrix generating the internal dynamics.
Electromagnetic Characterization of Magnetic Bar: Case of Square Cross-Section Shape
This paper presents a complete two-dimensional theoretical model for the electromagnetic behavior of square-section solid magnetic bars under sinusoidal loading. Through the application of Maxwell's equations within a Cartesian coordinate system and the integration of complex permeability, exact mathematical expressions are derived for mutual impedance, internal magnetic fields, flux, and core losses. Hyperbolic functions are utilized to separate the variables, enabling the accurate representation of edge flux accumulation and the 2D skin effect. In addition to mathematically decoupling eddy current and hysteresis losses, this formulation yields a new apparent permeability parameter. This parameter establishes a fast, reliable method for magnetic steel characterization that bypasses the extensive processing times associated with Finite Element Analysis (FEA). Numerical results over 1 Hz-1 MHz show the apparent relative permeability decreasing from 500 to 300 and a characteristic resistance peak near 700 kHz, marking the transition from volumetric to surface-dominated loss regimes.
Reinforcement Learning-Based Traffic Signal Control for IoT-Enabled Intersections
Urban traffic congestion remains a persistent challenge in car-dependent cities, imposing significant economic and societal costs. Traffic signal systems are increasingly deployed as networked cyber-physical components within smart-city infrastructures, where distributed sensing and edge intelligence enable adaptive traffic management. This paper investigates reinforcement learning (RL) as an edge-intelligent approach for adaptive traffic signal operation at a signalized urban intersection in Kuwait. A Proximal Policy Optimization (PPO)-based controller is developed to dynamically allocate green-phase durations using locally observed traffic states, without relying on future demand information or centralized coordination. The controller is evaluated in a realistic simulation environment informed by real-world hourly traffic volume data from Kuwait, and is compared against both conventional fixed-time control and a vehicle-actuated controller representing the current state of practice, using average vehicle delay, queue length, and emissions as performance metrics. Under nominal conditions, the proposed controller reduces average vehicle delay by 46% relative to fixed-time control and 34% relative to actuated control, while also lowering per-vehicle CO2 emissions by approximately 23%. These performance gains persist under demand perturbations of +/-15%, generalize from weekday to weekend traffic patterns, and are corroborated by a reward function ablation; low variance across five random seeds confirms their statistical reliability. These findings demonstrate the practicality of learning-based edge traffic signal control as a building block for IoT-enabled smart-city transportation systems, and as a deployable precursor toward fully connected, Internet of Vehicles (IoV)-based urban mobility.
comment: 15 pages, 7 figures, submitted to IEEE Open Journal of Intelligent Transportation Systems
A Pre-Dispatch Resonance Safety Criterion for AI Training Clusters
Hyperscale AI training clusters operate under the Bulk Synchronous Parallel protocol, which impose a periodic power swing on the transmission grid. Every GPU in the job transitions between compute and idle in lockstep, so the aggregate power traces a square wave at the training iteration period. Production iteration periods of one to ten seconds place the forcing frequency within the inter-area electromechanical mode band of large interconnections, where a training schedule can drive a mode at resonance. This paper derives a closed-form pre-dispatch safety criterion that bounds the maximum cluster size a grid can absorb at any proposed iteration period. The derivation inverts the steady-state forced two-area swing equations. The criterion defines a danger band of iteration periods, extends to the square-wave harmonics, and parameterizes the modal response from planning-study eigenanalysis and the forcing amplitude from GPU specifications. Applied to the IEEE 39-bus system at a production-representative duty cycle, the criterion shows that the maximum safe cluster at resonance is $66\,900$ GPUs under light damping. Rescheduling the same job less than one second away from resonance reduces the deviation $7.4\times$ with no hardware change. These results establish the training iteration period as a controllable grid-safety parameter and supply the analytic screening tool that reliability directives on current large loads lack.
comment: Submitted for North American Power Symposium (NAPS) 2026, October 11 - 13, 2026
TSO-DSO Coordination for Flexibility Management Across Voltage Levels SC
Several sources of flexibility in transmission and, especially, distribution networks are being unlocked by advances in information and communication technologies, aggregators, and new flexibility markets. However, maximizing benefits for both transmission and distribution system operators in a coordinated way requires new algorithms, modeling tools, and modernization of regulatory frameworks. Such approaches must account for uncertainties, the physical and operational constraints of flexibility providers and the grid itself, constraints on information exchange, and scalability, including computational requirements and time constraints. Given the diverse contexts and jurisdictions around the world, there is no single recipe for achieving coordination, but important trends and shared challenges are emerging. This paper surveys the complexities of coordination from technical, market, and technological perspectives, and outlines current practices, proposed approaches, and future research directions to effectively manage, coordinate, model, and leverage flexibility across voltage levels.
comment: 27 pages, 5 figures, invited survey paper to the 24th Power Systems Computation Conference (PSCC 2026)
Uncertainty-Disentangled Probabilistic Stability Analysis in Wind Power Integrated Weak Grids
Conventional probabilistic small-signal stability analysis (PSSSA) propagates a single forecast distribution, conflating irreducible weather randomness (aleatoric) with reducible forecast-model uncertainty (epistemic). This letter propagates a second-order renewable forecast through the modal-stability map via an independent \emph{germ} variable, separating the two contributions exactly in closed form by a disentangled polynomial chaos expansion (d-PCE). The split underpins a forecast-aware $(α,β,γ)$ stability certificate whose conservative branch converges to its irreducible aleatoric limit at $O(N^{-1/2})$ -- making a failed certificate diagnostic: epistemic-dominated risk recovers with better data; aleatoric-dominated risk needs improvements of the physical control system.
Joint Visibility Analysis of RIS in Non-Terrestrial Networks through Stochastic Geometry
Non-Terrestrial networks (NTNs) are a key theme in upcoming 6G communications, especially for ubiquitous coverage. Urban environments, comprising of high rise buildings often result in blocking the line of sight (LoS) path between the user equipment (UE) and the NTN base station (NTN-BS). In this paper we investigate the situation where reconfigurable intelligent surfaces (RIS) are deployed on the building roof-tops to ensure multi-hop connectivity between the UE and the NTN-BS. In such a scenario, it becomes crucial to statistically study the LoS visibility of the RIS from the UE as well as from the NTN-BS, hence termed as joint visibility. In this work, accounting for the dual stochasticity arising from the locations of the RIS deployed buildings and the respective random building heights, we statistically study the probability of joint RIS visibility in a two-dimensional (2D) scenario considering a deterministic location of the NTN-BS. Further, we study the joint RIS visibility statistics conditional on the UE-NTN link being LoS or non-LoS. For the RISs deployed as a point point process (PPP) having exponentially distributed heights, the expected RISs jointly visible under the unconditional and conditional geometric settings are derived in closed form. Interestingly, in the 2D setting, the maximum expected RISs jointly visible, unconditionally, is twice the Basel number $(π^2/ 6)$. The simulated results are analyzed over building density, average building height, the altitude and position of the NTN-BS. We also illustrate probability heatmaps, demonstrating the strongest chance to have a RIS used conditioned on the system geometry. This study is expected to be useful in planning the deployment of RIS in urban areas, improving the signal and for assessing economic aspects.
comment: Manuscript consists of 13 pages, 21 figures (including subfigures)
DC Link Capacitor Ripple Constraints Limit the Benefits of Utility-Owned Four-Wire Power Converters
Utilities are increasingly interested in power converters to increase the headroom of their assets by actively controlling power flows on their network. In this work we demonstrate that thermal limits of dc link capacitors can result in substantially diminished benefits of these converters under unbalanced operation, due to constraints on neutral current and double-line frequency power ripple. Considering nine voltage source converter topologies with varying ripple capabilities, the upper bound (in terms of additional headroom released) increases by more than 80% compared to a no-ripple case for the application of phase current unbalance mitigation.
comment: Accepted for presentation at PowerUp 2026 conference (Boulder, Colorado, USA)
Mind the Intention: Task-Aware Backdoor Attacks for Forecast-Driven Distribution Network Operations
Accurate distributed energy resources (DERs) forecasting is critical for downstream optimal operations. However, such forecast-based operation can be highly vulnerable to cyberattacks. While existing research mainly focuses on adversarial attacks, we pivot to a more controllable and persistent threat: backdoor attacks. In time series forecasting, a backdoored model generates an attacker-specified target pattern whenever a trigger is embedded in historical inputs. This paradigm naturally fits the entire DER forecast-optimization-operation chain. In this paper, we investigate whether and how backdoor attacks can compromise distribution network operations and propose GridTroj, a unified backdoor framework tailored for this scenario. Unlike standard time series backdoor approaches that train a poisoned model to match a predefined target only in terms of forecasting error, GridTroj explicitly incorporates the attacker's intention and optimizes the attack toward operational disruption. Specifically, GridTroj coordinates two key modules. The Intention Planner designs operation-damaging targets and poisoning strategies, while the Backdoor Realizer constructs the corresponding network architecture and training strategy to learn the trigger-target association. Experiments on three downstream optimization tasks demonstrate that GridTroj can effectively compromise grid operations and outperforms existing baselines. Our code is available at https://github.com/YuxuanCEE/GridTroj.
Inference as Flexibility: Ramp Management for Transmission-Connected AI Data Centres
The rapid growth of large AI data centres introduces new operational challenges for power systems, including rapid ramping, oscillatory load behavior, voltage fluctuations, and supply-demand balancing impacts. For example, the Alberta Electric System Operator (AESO) has identified transmission-connected data centres (TCDCs) as large non-conforming loads that may need to limit their point-of-connection ramp rates. Existing mitigation approaches mainly rely on exogenous electrical resources, such as battery energy storage systems (BESS). This paper presents a proof-of-concept demonstration of a complementary software-defined mitigation layer: using flexible large language model (LLM) inference serving as endogenous TCDC flexibility to partially offset AI training power ramps. We consider a 150 MW TCDC with training, inference, and base-load components. A measured LLaMA-2-70B fine-tuning power profile is scaled to represent an aggregate training block, while measured LLaMA-3.1-70B inference power traces are used to model batch-size-dependent inference flexibility. Three strategies are compared: BESS-only mitigation, batch-size-only control, and coordinated batch-size plus BESS control. Simulation results show that the hybrid strategy reduces BESS discharge energy by 71% and peak discharge power by 51%, while maintaining near-complete compliance with a 10 MW/min ramp limit.
Synthesized-Isotropic Narrowband Channel Parameter Extraction from Angle-Resolved Wideband Channel Measurements
Angle-resolved channel sounding using antenna arrays or mechanically steered high-gain antennas is widely employed at millimeter-wave and terahertz bands. To extract antenna-independent large-scale channel parameters such as path loss, delay spread, and angular spread, the radiation-pattern effects embedded in the measured responses must be properly compensated. This paper revisits the technical challenges of path loss/path gain calculation from angle-resolved wideband measurements, with emphasis on angular-domain power integration where the scan beams are inherently non-orthogonal and simple power summation leads to biased isotropic-equivalent power estimates. We first formulate the synthesized-isotropic narrowband power in a unified matrix form and introduce a beam-accumulation correction factor, including an offset-averaged variant to mitigate scalloping due to off-grid angles. The proposed framework is validated through simulations using channel models and 154~GHz corridor measurements.
Control Synthesis with Reinforcement Learning: A Modeling Perspective
Controllers designed with reinforcement learning can be sensitive to model mismatch. We demonstrate that designing such controllers in a virtual simulation environment with an inaccurate model is not suitable for deployment in a physical setup. Controllers designed using an accurate model is robust against disturbance and small mismatch between the physical setup and the mathematical model derived from first principles; while a poor model results in a controller that performs well in simulation but fails in physical experiments. Sensitivity analysis is used to justify these discrepancies and an empirical region of attraction estimation help us visualize their robustness.
Artificial Intelligence for Power-Converter-Rich Electrical Systems: A Review
Power-converter-rich electrical systems, formed by renewable generation, electrified transportation, and inverter-based resources, exhibit strongly nonlinear dynamics, multi-physics design tradeoffs, fast control requirements, and growing reliability and cybersecurity constraints. These characteristics strain workflows that rely only on physics-based modeling, sequential optimization, and rule-based operation. This paper reviews artificial intelligence (AI) for power-converter-rich electrical systems through a life-cycle and deployment-readiness perspective. The literature is organized across converter design, real-time control, system-level operation, and compliance-oriented governance. For design, we examine surrogate modeling, topology and parameter synthesis, EMI/EMC-aware optimization, reliability-oriented design, and knowledge-assisted workflows. For control, we compare supervised learning, reinforcement learning, learning-augmented predictive control, and safety-constrained learning according to their role in closed-loop implementation. For operations, we focus on microgrid coordination, forecasting, distribution-system observability, privacy-preserving coordination, and cyber-resilient operation where converter-interfaced resources shape the operating problem. Across these stages, the review emphasizes deployment-critical gaps, including stability certification, constraint satisfaction, interpretability, extrapolation, data efficiency, sim-to-real transfer, embedded latency, cybersecurity, privacy, and standards alignment. The resulting taxonomy is intended to clarify where AI is already useful as an engineering support tool and where further validation is needed before autonomous or safety-critical deployment.
Linear Power System Modeling and Analysis Across Wide Operating Ranges: A Hierarchical Neural State-Space Equation Approach
As modern power systems exhibit increasingly high-dimensional, nonlinear, and uncertain characteristics, the applicability of classical linear state-space methods is severely challenged. Existing paradigms struggle to reconcile the analytical transparency of physics-based models with the continuous nonlinear generalization of AI. To address this, the Hierarchical Neural State-Space Equation (HNSSE) framework is proposed. At the component level, the formulated Neural State-Space Equation (NSSE) extends Neural ODEs to learn continuous dynamic manifolds across varying conditions while strictly preserving local analytical transparency. At the system level, a hierarchical architecture analytically fuses components via network constraints, constructing an interaction-consistent global Neural ODE while circumventing the curse of dimensionality. To ensure robust convergence under noisy measurements, a training strategy synergizing spatiotemporal slicing, physics-informed curriculum learning, and Expectation-Maximization-based refinement is established. Validation on the large-scale Guangdong Power Grid demonstrates the framework's remarkable performance in interpretable state-space reconstruction, high-fidelity trajectory prediction, continuous stability perception, and noise robustness. Comprehensive comparisons substantiate HNSSE's superiority as a unified, interpretable paradigm for complex power system modeling.
comment: 24 pages, 12 figures, 5 tables
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs
Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. Mixture-of-Experts (MoEs) architectures partially decouple model capacity from per-token compute. This efficiency alone does not make MoE training feasible over ordinary Internet links or loosely connected commodity hardware since active expert routing still assumes high-speed datacenter fabrics. Low-communication methods such as DiLoCo and Photon reduce synchronization frequency across distributed sites, mitigating bandwidth constraints, yet still require full model replicas at every site. This creates a mismatch: modern MoEs have sparse data paths, but their distributed training infrastructure remains communication-dense and memory-inefficient, limiting attempts to pool geographically distributed compute. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers and skipping non-resident experts during local training. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over Distributed Data Parallelism (DDP) via partial expert replication in controlled regimes; (II) achieves empirical throughput speedups of up to 1.4x through the skip-token mechanism; and (III) shows stable routing in the trained regimes and projects the communication/memory benefits to 100B-scale configurations through system modeling.
Social learning community detection with nonlinear interaction
Conventional community detection requires centralized network data, making it unsuitable for distributed or privacy-preserving systems. In this paper, we demonstrate that macroscopic graph partitioning can emerge purely from strictly local, privacy preserving interactions driven by social learning. By reframing clustering as a symmetry-breaking process within nonlinear opinion dynamics, we show that exchanging saturated state dependent signal (like public actions) forces a network to naturally fracture along its sparsest cuts. We mathematically establish the spectral conditions under which dense core communities lock into stable, polarized states, robustly resisting external influence. To apply this mechanism, we propose three decentralized algorithms, leading up to the Score-based Edge Reliability (SER) framework. By evaluating network ties across multiple independent discussion topics, SER statistically bypasses the errors of traditional greedy bisections and naturally isolates structurally ambiguous frontier nodes. Validations on the ABCD benchmark and the real-world Ngogo chimpanzee network confirm that our fully decentralized approach matches the accuracy of globally optimized heuristics (e.g., Louvain, Leiden) up to a theoretical limit of detectable graphs.
Circuit realization and hardware linearization of monotone operator equilibrium networks
It is shown that the port behavior of a resistor-diode network corresponds to the solution of a ReLU monotone operator equilibrium network (a neural network in the limit of infinite depth), giving a parsimonious construction of a neural network in analog hardware. We furthermore show that the gradient of such a circuit can be computed directly in hardware, using a procedure we call hardware linearization. This allows the network to be trained in hardware, which we demonstrate with a device-level circuit simulation. We extend the results to cascades of resistor-diode networks, which can be used to implement feedforward and other asymmetric networks. We finally show that different nonlinear elements give rise to different activation functions, and introduce the novel diode ReLU which is induced by a non-ideal diode model.
Random Walk Learning and the Pac-Man Attack
Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the ``Pac-Man'' attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the Average Crossing (AC) algorithm--a fully decentralized mechanism for duplicating RWs to prevent RW extinction in the presence of Pac-Man. Our theoretical analysis establishes that (i) the RW population remains almost surely bounded under AC and (ii) RW-based stochastic gradient descent remains convergent under AC, even in the presence of Pac-Man, with a quantifiable deviation from the true optimum. Our extensive empirical results on both synthetic and real-world datasets corroborate our theoretical findings. Furthermore, they uncover a phase transition in the extinction probability as a function of the duplication threshold. We offer theoretical insights by analyzing a simplified variant of the AC, which sheds light on the observed phase transition.
comment: The updated manuscript represents an incomplete version of the work. A substantially updated version will be prepared before further dissemination
Rapid Quantification of Outdoor Object Visibility in Urban Setting Using Connected-Vehicle Fields of View
Identifying locations that offer maximum visual exposure to passing vehicular traffic is a core problem in urban analytics, with applications spanning urban design, navigation, location-based services, and the placement of street-level assets. Traditional site selection methods often rely on static traffic counts or subjective assessments. This research introduces a data-driven methodology to objectively quantify location visibility by analyzing large-scale connected vehicle trajectory data within urban environments. We model the dynamic driver field-of-view using a forward-projected visibility area for each vehicle position derived from interpolated trajectories. By integrating this with building vertex locations extracted from OpenStreetMap, we quantify the cumulative visual exposure, or ``visibility count'', for thousands of potential points of interest along roadways. The core technical contribution involves the construction of a BallTree spatial index over building vertices. This enables highly efficient (O(logN) complexity) radius queries to determine which vertices fall within the viewing circles of millions of trajectory points across numerous trips, significantly outperforming brute-force geometric checks. Analysis reveals two key findings: 1) Visibility is highly concentrated, identifying distinct 'visual hotspots' receiving disproportionately high exposure compared to average locations. 2) The aggregated visibility counts across vertices conform to a Log-Normal distribution.
The Potential Welfare Gains from Curtailment Trading Under Non-Firm Interconnection
Rapid growth of large loads, especially data centers, is straining grid capacity and increasing interest in non-firm interconnection agreements that exchange faster grid access for curtailment exposure. This shift creates opportunities for differentiated reliability, where curtailment is allocated according to the value consumers place on uninterrupted service. This value is often expressed through the value of lost load (VOLL), an estimate of the cost a consumer bears for unserved energy. Because VOLL differs by more than a hundredfold across customer classes, pro-rata allocation, which cuts every load by the same proportion, ignores variation that could be leveraged to improve grid utilization. This paper introduces the network-constrained Curtailment Credit Market (CCM), a mechanism that lets one curtailable load pay another to take on part of its curtailment obligation. In this market, a high-VOLL load can reduce its own interruption by paying a lower-VOLL load to absorb additional curtailment. Crucially, the CCM clears while enforcing transmission limits. We prove that the CCM can implement every curtailment pattern available to an idealized planner that knows each load's VOLL and assigns curtailment directly. If agents report true lost-load values, CCM clearing attains the planner's total value of served load, the highest value achievable under network constraints. We evaluate the CCM on three test networks: a 3-bus network, the IEEE 24-bus network, and a reduced New York grid spanning multiple load zones. Across these networks, the CCM raises the total value of served load by 1.41 to 1.83 times relative to pro-rata curtailment.
TIP-Search: Time-Predictable Inference Scheduling for Market Prediction under Uncertain Load
Real-time market prediction services need correct predictions before a decision deadline; a correct prediction delivered late is not a usable service output. TIP-Search studies time-predictable inference scheduling over fixed market predictors under uncertain load. It filters conformal latency-quantile feasible models, dispatches over finite workers, and uses shielded constrained online experts to trade accuracy, queue pressure, and deadline risk. The official systems-replay controller is OCO-ACPO, a projected-dual shielded expert selector; SA-OCO-ACPO is a nonstationary stress extension that records interval stress, regret proxies, and constraint-violation proxies while preserving the CPO safety shield. On the optimized deployable pool, TIP-Search reaches 0.994 raw accuracy and 0.991 timely accuracy. On official TLOB FI-2010 h=10, TIP-Search++ raises timely accuracy from 0.156 to 0.239 and deadline satisfaction from 0.391 to 0.962. In the matched h10 profiled systems replay, OCO-ACPO reaches 0.303 timely accuracy and 0.951 deadline satisfaction, with paired condition gains over RAMSIS/SneakPeek/utility-style comparators of $+0.00285$ timely accuracy ($p=0.0118$) and $+0.0146$ deadline satisfaction ($p=1.5{\times}10^{-5}$). SA-OCO-ACPO improves timely/deadline service by 0.188--0.417 over CPO under nonstationary stress. The claim is a systems scheduling result, not a broad LOB classifier leaderboard.
Robotics
THREAD: Trajectory Planning for Hybrid Rigid-Soft Manipulators with Environment-Aware Diffusion IROS 2026
Manipulation in confined environments, such as threading a manipulator through narrow apertures, remains a fundamental challenge, especially for conventional rigid robots. Hybrid rigid-soft manipulators offer promise but face two compounding planning challenges: backbone shapes feasible in free space become infeasible under environmental contact, and planning rigid and soft segments independently ignores their kinematic coupling. We present THREAD, the first diffusion-based trajectory planner for hybrid manipulation, learning a generative prior over physically realizable backbone trajectories conditioned on local environment geometry, with physics-inspired losses encoding curvature, smoothness, and collision constraints jointly across both segments. Trained in simulation, THREAD achieves 92.4% task success with 5x fewer collisions than the strongest baseline. We show cross-embodiment real-world transfer with minimal online updates, successfully threading through apertures as small as 1.3x the soft segment diameter.
comment: Project Page: https://robot-thread.github.io, IROS 2026
Rotation-Aware Point-Cloud Embeddings for Vision-Based In-Hand Reorientation
Point-cloud goals provide a direct way to specify dexterous in-hand reorientation: instead of defining an object-specific pose frame or estimating 6D pose at test time, the policy is given the desired 3D geometry of the object. Yet raw point-cloud goal conditioning is poorly conditioned for policy learning. Current and goal clouds are unordered, independently sampled, and often visibility-dependent, so their discrepancy entangles object rotation with permutation, resampling, and unstable correspondence structure. For this reason, prior point-cloud manipulation methods typically add structure outside the representation itself, such as explicit pose or relative-pose inputs, dense flow features, or distillation from privileged teachers. We close this gap by learning a rotation-aware point-cloud embedding whose Euclidean latent distance is calibrated to the SO(3) geodesic error between object orientations. The resulting representation turns current-goal comparison into a smooth control signal, allowing a model-free RL policy to act from current and goal point-cloud embeddings, proprioception, and centroid metadata, without object pose, relative pose, dense flow, or teacher-action supervision. In in-hand reorientation experiments, this interface matches privileged-state and distillation-based baselines while avoiding brittle test-time computation of structured pose or flow inputs. These results suggest that point-cloud goals become practical for this task when the representation, rather than an external module, encodes the task-relevant geometry of rotation. We also show evidence that generic visual point-cloud pretraining is insufficient for such a current-goal comparison because it discards the task-relevant state and preserves only shape features.
A Novel Bio-Inspired Fish Robot with Tunable Stiffness via Particle Jamming ICRA2026
Fish achieve efficient swimming across varied speeds through active modulation of their body flexibility. To explore the effects of tunable stiffness on swimming performance, we present a bio-inspired freely swimming fish robot with a rapidly tunable particle-jamming body. This design enables rapid stiffness adjustments with negligible changes in shape or volume, achieving a 54% variation in flexural rigidity across vacuum pressures of 0 to -40 kPa. We visualize the midline of the oscillating body under both low- and high-stiffness conditions, and the comparison confirms that the body curvature varies with stiffness. We further experimentally evaluate the tunable stiffness body's effects on swimming performance using velocity and cost of transport (CoT) measurements obtained via a motion tracking system. Results show that active stiffness tuning is essential for sustaining efficient and high-speed swimming across beating frequencies of 1-3 Hz. At low frequencies (1-1.5 Hz), a softer body (0 kPa) maximizes velocity and minimizes CoT, whereas at high frequencies (2.5-3 Hz), a stiffer body (-40 kPa) delivers superior velocity and reduced transport cost. These findings highlight stiffness modulation as a key strategy for adaptive and efficient propulsion in bio-inspired robotic swimmers.
comment: ICRA2026 accepted, best paper finalist
RoverDevKit: An open, physics-grounded tradespace toolkit for conceptual design of lunar micro-rovers
Pre-Phase-A design of lunar micro-rovers is dominated by tightly coupled mobility, power, thermal, and mass trades, yet conceptual-design tooling for the rapidly growing sub-50 kg class is typically proprietary, weakly benchmarked, or too slow to drive optimization. We contribute RoverDevKit, an open analytical evaluator coupling terramechanics, mass, power, thermal survival, and traverse that runs in 30ms per mission, fast enough to serve directly as a multi-objective optimizer's fitness function. Across mare, polar, highland, and crater-rim scenarios, NSGA-II Pareto fronts show that the binding design trade changes with mission profile within a single mass class: energy storage dominates at high latitude, slope traction on loose highland regolith, and traverse range on mare and crater-rim missions. Notably, rigid four-wheel layouts Pareto-dominate the full modeled mass range under smooth-regolith range-mass-slope objectives, contrary to the expectation that six-wheel architectures become optimal at heavier masses; six-wheel rocker-bogie layouts enter the Pareto set only once missions impose an obstacle-navigation requirement. The evaluator performance is benchmarked using both component and system checks: the terramechanics kernel matches measured single-wheel drawbar pull within the literature model-form band on two independent datasets, the bottom-up mass model predicts published in-class (5-50 kg) rover masses to 13.3% median absolute error, and a rediscovery check places real micro-rovers near the optimizer's fronts. Propagating the measured terramechanics error through the optimizer leaves the qualitative design rules unchanged. The tool, data, validation artifacts, and figure-generation scripts are released openly.
Programmable magnetic soft robots with controlled locomotion and directional liquid cargo release IROS 2026
Magnetically programmable soft elastomers enable complex shape morphing and locomotion dynamics in small scale soft robots under external magnetic fields. Benefiting from their programmed deformation and wireless actuation capabilities, magnetic soft robots have emerged as promising platforms for targeted drug delivery, especially in human gastrointestinal tract. However, achieving controlled directional liquid cargo release toward desired tissue interface while preserving the encoded shape morphing and locomotion capabilities remain a significant challenge. Here, we report a new design strategy that employs an optimized magnetization profile to enable controlled directional release of aqueous cargo without compromising shape morphing and locomotion capabilities. Magnetic soft robots with a specific spatially distributed magnetization profile allow directional alignment of the release interface with the orientation of the external magnetic field. This orientation control ensures active alignment of the release interface toward the intestinal wall prior to drug release. An interconnected microporous elastomer is embedded within the robot for aqueous cargo storage, while a thin microcrystalline wax layer seals the release opening hole to isolate the stored liquid cargo from external environment during transport. Triggered release is achieved by mechanically rupturing the wax sealing layer under a higher magnitude external magnetic field. Controlled directional flipping, locomotion, and triggered release are decoupled through external magnetic field's direction and strength. The controlled directional release strategy reported here integrates directional targeted liquid cargo release, shape morphing, and locomotion, which establishes the groundwork for target drug delivery in gastrointestinal tract applications.
comment: This manuscript has been accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Shape-programmable Magnetic Soft Membranes for Mechanically Active Microchannels
The capability to encode spatially distinct magnetization patterns within soft materials enables remote control over complex deformations. This characteristic is especially important for microfluidic platforms, where limited dynamic control of channel boundaries and laminar flow conditions usually restrict fluid transport and interactions. The present study introduces a shape-programmable magnetic soft membrane actuator as an active microchannel component that can dynamically modulate its shape under magnetic fields and therefore the microfluidic environment. The membrane is magnetically programmed using a template-based approach, which allows it to be controllably deformed in the form of a sinusoid under the influence of an external magnetic field. The membrane's integration into a microchannel converts a passive channel wall into a dynamically changeable interface, allowing active fluid manipulation and enhancing micromixing in laminar flow conditions. The proposed approach establishes a versatile platform for wirelessly controlled deformable interfaces in next-generation microfluidic and lab-on-chip systems.
comment: Supplementary video can be requested from dakyildiz@wisc.edu
Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models
Imitation learning has emerged as a powerful paradigm for learning visuomotor policies, but its generalisation and stability are limited by the scale and quality of demonstration data needed. A promising direction is to leverage more abundant but heterogeneous data sources, which differ in action space and often lack action labels altogether. Existing co-training approaches that combine heterogeneous data sources rely on heuristic and hand-engineered alignment techniques. In contrast, we argue that action representations should be grounded in prediction: actions that produce the same effect on the environment should share the same representation, regardless of their sources. To this end, we instantiate this principle by using a grounded latent-action world model (GLAM), a pair of generative models with a shared latent action space across data sources that is grounded by predicting future observations consistently across sources. This latent action space is used to train downstream behavioural cloning (BC) policies which map observations to latent actions and decode them back to robot actions, providing a paradigm for learning from heterogeneous data. Empirically, we demonstrate that GLAM successfully learns an aligned latent action space that facilitates action transfer across data sources with and without action labels. Across five manipulation tasks in simulation and in the real world, GLAM-aligned policies significantly outperform BC baselines and prior latent-action methods, achieving an average of +48% improvement in task success rate with the same data-scarce setting. Videos and code are available at https://viccccciv.github.io/glam/.
comment: 17 pages, 8 figures. Project page: https://viccccciv.github.io/glam/
Online Learning of Robust Legged Odometry with Minimal Exteroceptive Supervision
Robust locomotion and navigation for legged robots relies heavily on dependable odometry. Traditional multi-sensor fusion for such state estimation requires meticulous sensor calibration and platform-specific kinematic modeling, which complicates deployment. Industrially packaged exteroceptive sensors can provide accurate motion tracking but remain vulnerable to perceptually degraded conditions. We thus develop a plug-and-play, robust legged odometry system that eliminates the need for explicit exteroceptive-to-proprioceptive calibration or system kinematic modeling. Our approach leverages established exteroceptive motion pipelines as a continuous supervisory signal to train an online learned velocity neural network directly from proprioceptive data. An Invariant EKF (InEKF) is then used to fuse the learned proprioceptive or exteroceptive velocity (if any) and IMU data. When exteroception fails due to environmental degradation, the system seamlessly falls back to using the learned proprioceptive model, yielding a resilient legged odometry that readily adapts to new hardware. We demonstrate the platform-agnostic, easily deployable nature of our approach on different quadruped platforms, showcasing promising results in maintaining robust motion estimation across challenging scenarios.
Energy-based Compositional Diffusion Planning ICML 2026
Compositional diffusion planners aim to solve long-horizon robotic tasks using short training trajectories. Yet, current approaches often rely on the heuristic stitching of local predictions. We show that the resulting stitched update is generally a non-conservative field} that does not mathematically correspond to any valid global trajectory log-density function. We propose Energy-based Compositional Diffuser (ECD), a framework that formulates the global trajectory as the minimizer of the sum of local bridge potentials. This energy-based perspective defines a conservative correction field and contains a boundary reaction term that heuristic stitching omits. To enable efficient inference, we further introduce a Markov-based score approximation that computes the reaction term via a single block-tridiagonal solve, maintaining time complexity linear in the planning horizon. Empirically, ECD achieves state-of-the-art success rates on a range of OGBench stitching tasks, while nearly matching the inference speed of heuristic stitching methods. Code is available at https://github.com/GradientSpaces/ECD.
comment: ICML 2026
VQActFlow: Vector-Quantized Action Mode Steering for Multi-Task Robot Manipulation
Multi-task robot manipulation policies are challenging to learn from demonstration because traditionally a single network must select among qualitatively different action modes from a multimodal demonstration distribution, conditioned on language and visual context. A wrong mode selection means executing the wrong task or an action infeasible in the scene. Tokenizing continuous actions into a learned discrete codebook separates these modes at the representation level, offering structural advantages for multi-task learning. We propose VQActFlow, a multi-task manipulation policy that tokenizes action chunks and generates code sequences via Variational Flow Matching. VQActFlow maintains an explicit preference over action modes throughout generation. Inference-time guidance acts on this preference to steer mode commitment. We instantiate this with classifier-free guidance over language conditioning, which steers the policy toward the instructed action mode, and a learned codebook critic that supplies a complementary feasibility signal. We evaluate VQActFlow on three platforms: the LIBERO simulation benchmarks, a Unitree G1 humanoid performing whole-body pick-and-place, and an ALOHA-style bimanual platform performing contact-rich tasks. Across these benchmarks, VQActFlow outperforms both continuous and discrete baselines.
comment: 8 pages, 9 figures, 5 tables
Boundary-by-Mask: Few-Shot Instance Segmentation with Mask-Conditioned Boundary Learning for Texture-Poor Industrial Parts IROS 2026
Recent advances in large pre-trained models have led to remarkable progress in instance segmentation on general images. However, industrial scenarios remain challenging. Instance definitions are often application-specific and inconsistent, and the domain gap from general imagery is substantial due to weak textures and limited contextual cues. Consequently, a direct application of existing models is unreliable. We propose Boundary-by-Mask, a few-shot instance segmentation framework that supervises boundaries instead of interior appearance. Given a few RGB images and corresponding instance masks, the method extracts rich visual features using a foundation-model encoder and trains a lightweight Signed Distance Function (SDF) head to predict boundary-aware distance maps. Segmentation masks are obtained through an SDF-to-mask reconstruction process. By explicitly estimating contours, the framework achieves reliable instance separation even on low-texture and color-uniform surfaces. The instance definition is conditioned by the instance mask. Replacing the mask specifies the segmentation target, such as the whole object or a sub-part. A pixel-wise shallow MLP head enables rapid training. Experiments on industrial parts and food items with ambiguous boundaries show strong few-shot generalization, robustness in feature-poor conditions, and precise control over mask-level targets.
comment: 8 pages, 8 figures, accepted to IROS 2026
Conflict-Aware Switching for CBF-CLF-Based Multi-Goal Navigation
Quadratic programs (QPs) using Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) are widely used for safe control in reach-and-avoid navigation. However, the inherently conflicting nature of CBF and CLF constraints can lead to performance degradation, including slowdowns and deadlocks. This issue is exacerbated in multi-goal scenarios, where multiple nominal control objectives must be satisfied under shared safety constraints. Existing approaches for preemptive safety are often computationally expensive or overly conservative, while methods that relax or switch between nominal objectives are not well-suited for sequential goal-to-goal navigation. To address these limitations, we propose a conflict-aware switching strategy that detects high-conflict conditions and switches between available nominal control objectives to reduce constraint conflict. We apply this approach to multi-agent, multi-goal reach-and-avoid scenarios under CBF-CLF-QP control. Compared to a baseline sequential goal traversal strategy, our method reduces both completion time and timeout rates, demonstrating improved performance in satisfying all nominal control objectives while respecting safety constraints.
comment: CDC 2026 Submission Copy
Robot Critics that Sweat the Small Stuff
Large vision-language models contain several priors about the world and object interactions, making them useful critics during inference to steer robot policies towards success. However, closed-loop robot manipulation requires judging small visual differences between success and failure, which remains a challenge for current VLMs. We introduce a method to fine-tune critics by constructing pairwise progress supervision using success and failure rollouts obtained from a policy. Our fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. Additionally, we use an action-conditioned video model to predict the visual effect of several candidate actions sampled from a policy, and show that our critic can correctly identify successful candidates to execute, improving the average policy success rate by 11% across real-world tasks and 5.9% across simulation tasks.
LOGOS: LiDAR-Only Gaussian Elevation Splatting for Unified Tiny Obstacle Segmentation
Robust obstacle segmentation is essential for the safety of intelligent robots, where LiDAR-based perception systems play a fundamental role in the robot-environment interaction. While extensive LiDAR-based approaches have demonstrated high performance on common obstacles in urban scenarios, their results on tiny obstacles such as curbs, gravel, and potholes remain unsatisfactory due to the significant similarity between tiny obstacles and inherent road undulations. Moreover, their segmentation accuracy even deteriorates sharply when the LiDAR scans suffer from degradation in challenging off-road scenes. To overcome these bottlenecks, we propose LOGOS, a LiDAR-only unified tiny obstacle segmentation system, which models the road surface as a continuous mixture of 2D Gaussian primitives and distinguishes tiny obstacles via high-presicion elevation estimation. Unlike existing Gaussian splatting methods that rely on iterative RGB training, LOGOS is a backpropagation-free LiDAR-only approach. It directly estimates Gaussian parameters via a freespace-aware initialization by incrementally pruning non-road primitives using smoothness constraints. Subsequently, pointwise signed distances are computed via a novel normal-aware elevation splatting function, ensuring robustness to both flat and sloped terrains. We evaluate LOGOS on a highly heterogeneous benchmark of point cloud frames collected from urban mobility scenarios and mining haulage off-road environments. These data are practically acquired using different LiDAR sensors and exhibit large variations in point density, terrain roughness, and obstacle types. Experiments on the road and off-road scenes demonstrate that LOGOS significantly outperforms other state-of-the-art methods, particularly in degraded point cloud regions and challenging off-road scenarios, while maintaining real-time efficiency.
A Stitch in Time Saves Nine: Preserving Policy Compatibility Under Perception Updates in End-to-End Autonomous Driving
End-to-end autonomous driving systems tightly couple perception and decision-making through latent representations. Consequently, updates to perception models can alter these representations and degrade the performance of downstream policies that remain fixed. Existing solutions typically rely on policy retraining or architectural decoupling, both of which incur substantial computation and validation costs. In this paper, we formulate the model stitching problem for end-to-end autonomous driving and test the hypothesis that policy compatibility can be preserved through lightweight latent-space alignment. We study low-complexity model stitching methods, including linear and convolutional stitchers, for restoring compatibility between updated perception modules and frozen downstream policy modules. Experiments demonstrate that stitching effectively preserves downstream driving behavior under diverse perception updates, including changes in random initialization, sensor configuration, and training domain. In the most challenging cross-domain setting from nuScenes to CARLA, convolutional stitching retains over 91\% of the no-shift driving score while reducing adaptation time from \SI{22.18}{h} to \SI{0.91}{h}. These results suggest that model stitching provides an effective and computationally efficient alternative to retraining or fine-tuning for maintaining end-to-end autonomous driving systems. The model will be open-sourced upon paper acceptance at https://github.com/SCP-CN-001/model-stitching to support further research and development in autonomous driving.
comment: 12 pages, 9 tables, 5 figures; T-ITS under review
UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling
Occluded tasks remain a bottleneck in robot manipulation. Existing solutions either deploy additional physical cameras requiring training-inference camera parity, or rely on explicit 3D reconstruction with high computational cost. Moreover, both approaches rely on standard agent-view and wrist-view observations, while failing to capture occlusion information and future scene evolution. To this end, we propose UniviewVLA, a unified multiview Vision-Language-Action model with world modeling, which infers multiview scene evolution for action prediction from only standard two-camera observations. We demonstrate that by leveraging generated multiview future views from the world model, UniviewVLA reveals occluded cues and models future scene evolution, improving action prediction and removing the need for extra hardware or explicit reconstruction. Besides, to accelerate inference while preserving prediction accuracy, UniviewVLA develops Motion-Informative Token Compression, which compresses each generated view from 625 to 16 tokens and reduces per-view latency from 6-7s to 0.2-0.3s. UniviewVLA also proposes training-free Action-Entropy View Selection, which dynamically identifies the most action-informative view at different inference stages. Extensive experiments show that UniviewVLA achieves 95.8% on LIBERO and 4.60 on CALVIN ABCD to D, both standard occlusion-free benchmarks. On customized occlusion-focused tasks, it improves success rate from 40.0% to 73.3%, and average real-robot success rate by 33.4 points, demonstrating stronger occlusion-focused performance without sacrificing standard occlusion-free benchmarks.
Decoupling the Declarative from the Procedural in Vision-Language-Action Models
Deploying generalist robotic agents in the real world requires transferable skills. Specifically, a policy trained to clone a behavior from object-specific demonstrations must generalize beyond that object, otherwise data collection requirements become intractable. Recently, fine-tuning of pre-trained billion-parameter Vision-Language Models (VLMs), initially on large-scale robot datasets and then on fewer scenario-specific demonstrations, has emerged as the predominant paradigm for designing Vision-Language-Action (VLA) models. While these policies achieve state-of-the-art manipulation performance in-distribution, they remain brittle to minor spatial, semantic, and task variations. In this work, we address the inability of current models to decouple the declarative (i.e., concepts and entity semantics) from the procedural knowledge (i.e., how to do something) encoded in their parameters, which is a fundamental bottleneck for zero-shot skill transfer to novel objects. To address this, we propose w$^{2}$VLA, a new VLA model with restructured information flow. Rather than feeding all multimodal tokens from the VLM encoder into a large, opaque transformer-based action expert, our approach modulates the robot state sequence with visual, spatial, and skill information in a compositional and interpretable manner. Unlike popular, state-of-the-art VLAs, we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer capabilities across dissimilar, unseen objects.
ASCII Art Turns LLMs into VLA Controllers
Vision--Language--Action (VLA) controllers are often built by extending vision--language models (VLMs) with action supervision, relying on multimodal backbones with large data and compute requirements. We demonstrate that a text-only large language model (LLM) can be adapted into a VLA-style controller when visual observations are rendered into a text input using an ASCII representation. This ASCII-as-vision interface enables existing training and deployment stacks for LLMs to efficiently condition on visual state, follow natural-language instructions, and produce constrained, executable actions. We fine-tune and compare multiple LLMs and VLMs across model families and scales, using both expert demonstrations from a planning-based teacher, as well as DAgger for iterative improvement. In a 2D manipulation benchmark, in both simulation and on a physical manipulator, the resulting controllers can identify task-relevant entities and plan feasible action sequences. Our results suggest that ASCII rendering can serve as a lightweight, interpretable modality bridge from images to text, complementing conventional VLA pipelines, and opening directions for VLA research with text-only backbones.
Manipulider: A Multi-Engine Buoyancy-Controlled Robot for Thrusterless Underwater Gliding and Manipulation
The Manipulider is a buoyancy-actuated underwater robot that enables thrusterless, glide-like locomotion and attitude-based manipulation, while providing a magnetic modular interface for rapid payload swapping (e.g., a gripper or sensors). Four syringe-based buoyancy engines distributed around the body jointly regulate net buoyancy and the center of buoyancy, allowing the vehicle to maintain large tilt angles through static force balance without continuous thrust and to avoid propeller entanglement risks. We present the mechanical and electrical design, calibration procedure, and control architecture. Experiments with a gripper attached (no external payload) show a controllable buoyancy-displacement range of 40 mL per engine ({\approx}160 g total buoyancy authority), maximum statically stable tilts of 64.6° (single-engine) and 61.8° (dual-engine), and representative vertical and tilt-transition dynamics. We further demonstrate tilt regulation, controlled ascent/descent primitives, and a proof-of-concept gripper-based payload-transport sequence without thrusters.
Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Exploring Query-Based Segmentation and Increased Spatial Context for Outdoor Scene Understanding ICRA 2026
In this report, we present our submission to the GOOSE 2D Fine-Grained Semantic Segmentation Challenge, organized as part of the Workshop on Field Robotics at ICRA 2026. The challenge combines data from the GOOSE and GOOSE-Ex datasets, which comprise more than 13k images captured from 4 distinct camera setups, annotated using a hierarchical taxonomy of 56 fine-grained classes and 11 broader categories. Starting from SegFormer as a baseline, we progressively improve segmentation performance through increased training crop sizes, a transition to the query-based Mask2Former architecture, and test-time augmentation. Our experiments show that query-based segmentation significantly outperforms the baseline model. Furthermore, increasing the crop size used during training yields substantial gains, highlighting the relevance of preserving scene context for fine-grained semantic disambiguation. Our final submission, using test-time augmentation, achieves an mIoU of 69.6% on the challenge test set, providing a strong baseline for fine-grained semantic segmentation in outdoor environments. To facilitate reproducibility and future research, code and weights will be made publicly available at https://github.com/RoboticsLabURJC/outdoor-fine-grained-segmentation .
comment: Ranked 5th in the GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the IEEE ICRA 2026 Workshop on Field Robotics
Temporal logics and formal synthesis for robot planning and control
As robots move from controlled environments into real-world settings, it becomes increasingly crucial to ensure that they perform as expected. A key step toward that goal is a rigorous specification of the desired robot behavior, capturing intricate temporal, spatial, and logical requirements. Complementing this, plan and control synthesis methods are needed to fulfill these specifications with provable guarantees. This manuscript presents temporal logics - particularly linear and signal temporal logic - as expressive specification languages for robot behavior over time. We then discuss principles of formal synthesis, from discrete graph- and game-based approaches to sampling-based motion planning, trajectory optimization, and control-certificate-based synthesis. Finally, we outline challenges in deploying formal synthesis in real-world robotics, emphasizing the interplay between modeling fidelity, computational tractability, and the types of rigorous guarantees that can be achieved.
Robot Self-Improvement via Human-Video Dynamics Models
A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot's own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation required for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states: each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors. These results show that human priors and robot failures can be combined to enable scalable autonomous policy improvement. Project page: https://ethz-mrl.github.io/robot-self-improvement-website/.
BIT-Nav: Brain-Inspired Trajectory Memory for Embodied Navigation CVPR 2026
Vision-Language Models (VLMs) for embodied navigation rely on selecting a fixed number of frames from a growing trajectory history. As episodes extend, this selection grows increasingly sparse, yet prior work shows no accuracy gain when scaling from 8 to 64 frames, suggesting the bottleneck is not frame quantity but the representation itself. Sparse frame selection cannot capture the structured behavioral signal that long-horizon reasoning requires: turning patterns, cumulative displacement, and path topology. We introduce BIT-Nav (Brain-Inspired Trajectory Memory for Navigation), a framework that augments frozen VLM navigation pipelines with a compact learned trajectory memory. Motivated by hippocampal path integration, where spatial experience is compressed into structured episodic traces rather than stored as raw sensory replay, BIT-Nav trains a Bi-GRU encoder over action and relative pose sequences via a multi-positive InfoNCE contrastive objective on trajectory prefixes sharing the same behavioral intent. The resulting embedding is projected into the VLM token space via a lightweight MLP and injected as a single memory token at each decision step, conditioning the model on structured motion history at constant token cost regardless of episode length
comment: Accepted to CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments
Overcoming Imperfect Kinematics in Surgical Robotics Through Sim-to-Real Visuomotor Learning ICRA
Robot-Assisted Surgery is integral to modern minimally invasive procedures, with automation emerging as the next frontier to enhance precision and reduce surgeon fatigue. This evolution is largely impeded by the inherent kinematic inaccuracies of surgical robots, where unreliable internal sensors lead to significant control errors. While previous methods attempted to mitigate these issues through complex model-based calibration, they often suffer from high cost and limited effectiveness. This work utilises a learning-policy to actively compensate for hardware inaccuracies using closed-loop visual feedback that was trained from a teacher-student learning framework. The policy can fuse unreliable internal readings with precise external visual data, allowing it to correct for kinematic errors in real time without needing a perfect physical model. The learned policy was successfully deployed on the da Vinci Research Kit, where experiments validated the fundamental feasibility of using external vision to overcome internal sensor deficits. This research provides a foundational and reliable control methodology, paving the way for more advanced and robust surgical automation.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Long-Distance Real-World Navigation of the Legged-Wheeled Robot Go2-W Using Deep Reinforcement Learning
Legged-wheeled robots have long been studied for their potential to combine the efficient flat-ground mobility of wheels with the rough-terrain capability of legs. However, examples of their application to long-range autonomous navigation in real environments remain limited. This paper reports our effort to build a deep reinforcement learning (DRL) based locomotion controller and an autonomous navigation system for the commercially available legged-wheeled robot Go2-W, and to apply them to long-range autonomous navigation in a real environment. For locomotion control, we extended a proprioception-only policy, which we had previously developed for quadruped robots, to the 16-DoF legged-wheeled robot. We also found that wheeled locomotion concentrates the load on the hip joints and causes heat concentration that hinders sustained travel, and obtained a policy that suppresses it by distributing the load. We evaluated the system at the Tsukuba Challenge 2025, demonstrating that it can autonomously traverse an approximately 2.8 km route including sidewalks, a park, and stairs without stopping due to overheating.
FLM-Occ: Feed-forward Likelihood Maximization for Efficient Indoor Occupancy Prediction ECCV 2026
Recent indoor occupancy prediction methods adopt Gaussian primitives as a sparse 3D representation for computational efficiency. However, their training relies on voxel classification, which imposes only local constraints and lacks global supervision on the distribution of the primitives. Therefore, they inevitably predict spurious primitives in empty regions, undermining both representational and computational efficiency. To address this, we propose Feed-forward Likelihood Maximization (FLM), a novel framework that reformulates occupancy prediction as voxel distribution estimation. In FLM, a network is trained to predict a mixture model that maximizes the likelihood over ground-truth occupied voxels in a feed-forward manner. To enable end-to-end training of networks and voxelization of a standard mixture model, we define mixture weights as normalized primitive volumes to implicitly enforce simplex constraints and derive novel voxelization formulas. Based on FLM, our FLM-Occ, a novel method that is capable of relocating randomly initialized primitives over long distances to model a scene. On Occ-ScanNet, FLM-Occ achieves superior accuracy using only 32 superquadrics, 2.7% of the prior SoTA, while running 3.7 times faster.
comment: Accepted to ECCV 2026
NAC: Neural Action Codec for Vision-Language-Action Models
Vision-language-action (VLA) models rely on discrete action tokenizers to bridge continuous robot control and autoregressive sequence modeling, yet existing tokenizers often trade off between compression, latency, and downstream performance. We revisit this design through the lens of neural audio codecs-convolutional encoder-decoder architectures with residual vector quantization that serve as the standard front end for audio foundation models. Motivated by their success, we introduce the Neural Action Codec (NAC), which treats short robot action trajectories as multi-channel 1D signals and compresses them using a multi-scale RVQGAN architecture. We observe that audio-specific mel-spectrogram objectives are ill-suited for kinematic signals; however, by replacing them with simple time-domain and non-mel spectral reconstruction losses, audio-codec-style models can autoencode actions with high fidelity without substantial architectural changes. NAC provides a compact, ordered token space via offset codebooks, enabling standard autoregressive policies to operate over short, structured sequences. Meanwhile, a Vocos-style decoder with an ISTFT head and adversarial discriminators recovers smooth, detailed trajectories. Across LIBERO-10, RoboMimic, and a suite of real-world manipulation tasks, NAC achieves lower reconstruction error and higher success rates than binning, FAST, and prior VQ-based tokenizers at comparable or better compression rates. These results demonstrate that repurposed neural audio codecs offer a strong, practical backbone for learned action tokenization in modern VLAs.
Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data
Trajectory prediction allows autonomous vehicles to anticipate the future behavior of surrounding objects (or agents) and, accordingly, maximize the safety and efficiency of their driving. State-of-the-art Transformed-based interaction-aware trajectory prediction models, which rely on attention mechanisms to capture multi-agent interactions and maximize prediction accuracy, are commonly trained and evaluated on long-range high-quality datasets. These datasets are typically obtained by aggregating data from multiple vehicles or drones and removing any object detection or tracking noise offline. Yet, information about a surrounding object's state (its position, speed, heading) is far from being noiseless in real-world deployments. Object state estimation is affected by perception uncertainties and localization errors that can be particularly large for objects received via Vehicle-to-Everything (V2X) communications. In this paper, we analyze the impact of noisy object state information on the trajectory prediction accuracy of a state-of-the-art Transformer-based interaction-aware trajectory prediction model. Our study demonstrates that trajectory prediction accuracy can rapidly deteriorate as the noise intensity increases. Numerical results show that the prediction accuracy can reduce by a 1.3x factor under small noise levels and by as much as a 3.9x factor under the highest (yet realistic) noise conditions. These findings reveal the strong sensitivity of trajectory prediction models to noisy data, underscoring the need for more realistic training and evaluation datasets as well as noise mitigation strategies.
RogueRover: Autonomous Rogue Device Localization for Incident Response
Physically localizing unauthorized wireless devices remains a critical bottleneck in cyber-physical security operations, where rogue access points can provide entry points for lateral movement and persistent compromise. While such devices can often be detected through network-side mechanisms, determining their physical location typically requires dense sensing infrastructure, site-specific RF fingerprinting, or manual inspection, limiting timely incident response. We investigate whether a single commodity robot can autonomously detect and localize rogue wireless devices under zero-configuration constraints, without RF fingerprinting, pre-installed sensors, or site calibration. We present RogueRover, an end-to-end system in which a quadruped robot autonomously patrols, collects spatially labeled RSSI measurements via a standard 802.11 interface, and estimates device locations offline. We evaluate the system across 11 patrol runs in a real indoor environment, with 6 rogue devices deployed under heterogeneous propagation conditions. Across 62 AP-patrol sessions, RogueRover achieves a median single-patrol localization error of 1.62 m without prior RF knowledge. Under multi-run aggregation, five of six devices are localized within 1 m. A blind trial validates the full pipeline, correctly identifying rogue devices among 73 observed BSSIDs and localizing them with errors of 0.34 m and 1.84 m. Across environments, simple weighted-centroid estimators perform comparably to, or better than, parametric path-loss models, indicating that measurement coverage from autonomous patrols is the primary determinant of localization accuracy under zero-prior constraints. Our results demonstrate that infrastructure-free, autonomous localization is feasible in practice, enabling rapid physical incident response in cyber-physical environments without additional sensing infrastructure.
comment: 14 pages, 4 figures, 2 tables. Code and data are available in the accompanying artifact repository
Spectral GS-SLAM: Observability-Aware, Degeneracy-Robust Tracking for Real-Time 3D Gaussian Splatting SLAM IROS 2026
Recent 3DGS-SLAM systems enable real-time operation by leveraging conventional feature matching or ICP-based tracking, thereby avoiding the heavy dense photometric optimization used in earlier approaches. However, feature matching remains prone to failure in textureless environments, while ICP-based tracking struggles in structureless or geometrically degenerate scenes due to ill-conditioned optimization. To address this issue, we propose Spectral GS-SLAM, an efficient yet robust tracking framework that integrates ICP with complementary feature-based constraints. Our method mitigates numerical instability by adaptively compensating under-constrained directions in degenerate scenarios, without interfering with the shared Gaussian representation used for mapping. We further introduce a Gaussian-aware planarity weighting mechanism that exploits the intrinsic covariance structure of 3D Gaussians to characterize scene geometry and guide information fusion. Extensive evaluations on challenging TUM RGB-D sequences demonstrate that Spectral GS-SLAM achieves real-time performance (40.14 FPS) while maintaining consistent tracking in both structureless and featureless environments. The proposed method preserves trajectory integrity in degenerate scenes while maintaining competitive performance in non-adverse conditions.
comment: This work has been accepted to IROS 2026
A Human-Inspired Thumb-Index Robotic Hand with Strain Gauges Embedded in Soft Joints
Human hand grasp adaptation depends mainly on the synergy between physical structure and biological feedback. Inspired by this biomechanical principle, the Safe Thumb-Index Robotic (STIR) Hand was developed as a minimal, lightweight, and low-cost two-digit prototype featuring an asymmetric thumb-index configuration. By pairing an underactuated, tendon-driven mechanical design with flexible strain gauges embedded into silicone-encapsulated soft joints, the system achieves passive grasp adaptation while establishing both internal proprioception and external perception. Unsupervised analysis was carried out on a dataset of the STIR hand grasping 20 different objects, along with an object classification task and an ablation study to highlight the contribution of the soft joint sensors. The object classification task discriminated object size, shape, and material stiffness with a high classification accuracy. In contrast to traditional industrial grippers and robotic hands, the STIR Hand demonstrates that sensorized compliant joints significantly improve overall sensitivity and ensure safe grasping, while remaining independent of additional fingertip tactile elements or external vision systems. Finally, a comparison to similar devices grasping identical objects validates the utility of the STIR Hand.
comment: 8 pages, 7 figures
Ultra-Fusion: A Resilient Tightly-Coupled Multi-Sensor Fusion SLAM Framework under Sensor Degradation and Spatiotemporal Perturbation for Intelligent Transportation Systems
Reliable localization is essential for intelligent transportation systems (ITS), including autonomous vehicles, quadruped last-mile carriers, and infrastructure-inspection unmanned aerial vehicles (UAVs). Although tightly-coupled multi-sensor fusion improves accuracy in favorable conditions, deployed systems remain vulnerable to sensor degradation -- poor illumination, LiDAR degeneracy, wheel slippage, and GNSS outage -- and to spatiotemporal calibration errors. These failures are common in urban canyons, tunnels, and high-speed corridors, where localization drift can degrade route tracking, tunnel passage continuity, and local map alignment. This paper presents Ultra-Fusion, a tightly-coupled multi-sensor localization framework based on a unified sliding-window estimator. Asynchronous measurements are timestamp-ordered and converted into optional factors within one optimization window, supporting WIO, VIO, LIO, and LVIO with optional wheel and GNSS augmentation. Observability-aware initialization selects the bootstrap mode, factor-wise reliability scheduling gates degraded measurements, and online LiDAR--IMU spatiotemporal calibration refines temporal offsets and rotational extrinsics during operation. We extend the M3DGR benchmark with simulation trajectories and evaluate more than 60 open-source SLAM systems on M3DGR, M2DGR-Plus, KAIST, GrandTour, and MARS-LVIG. The results show competitive accuracy across wheeled, legged, and aerial platforms under long-duration and high-speed operation, degradation, and calibration perturbation, improving localization availability for road-level autonomy, campus and warehouse mobility, and low-altitude aerial inspection. To benefit the industrial and academic community, we will release source code and datasets upon paper acceptance.
comment: 19 pages
FleetAgent: Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages
Large-scale autonomous fleets rely on teleoperation to resolve rare failures, yet streaming raw sensor data from many vehicles is costly, and remote operators can only monitor a limited number of vehicles at a time. We introduce FleetAgent, a cloud-hosted multimodal large language model (MLLM) assistant that consumes compact vectorized vehicle-to-network (V2N) messages, such as map elements, detected objects, and the ego planned path. It provides a structured natural-language response (including narration, explanation, and evaluation of the plan and scene), along with an intervention urgency score for operator prioritization. To make structured messages compatible with token-based MLLMs, we propose VecFormer, a vector-to-embedding interface with differentiable top-K context selection that bounds context length and GPU KV-cache growth, enabling more efficient batch processing, which is important under the context of cloud-hosted large-scale fleet management. We also construct VecEval, a nuScenes-derived dataset with paired human and synthetic imperfect plans and human-verified language labels, to facilitate the training and evaluation of our proposed system. Our proposed system can reduce uplink payload by up to 625 times compared with raw images and reduce KV-cache memory by 16.54 times compared with original text descriptions. On VecEval, FleetAgent improves Lingo-Judge score by 16.8% and reduces intervention failure rate by 19.9%, compared with Qwen2.5-VL-7B using language descriptions. These results demonstrate that FleetAgent can utilize compact structured V2N messaging to enable efficient, explainable teleoperation monitoring for autonomous fleets.
A UAV-Mounted Sensor Network for Close-Range Inspection of Wind Turbine Rotor Blades
Inspection of offshore wind turbine rotor blades is critical for predictive maintenance to maximise efficiency and extend operational lifetime. However, it remains a challenging task due to remote locations, large structural dimensions, and the limitations of current UAV-compatible sensor systems. While existing approaches can detect certain types of surface anomalies, reliable classification of defect types often remains a manual and error-prone process. This paper presents the design of a UAV-mounted multimodal sensor network combining an industrial RGB camera, a passive thermal infrared camera, and an in-house developed 3D scanner. All sensors are co-calibrated into a common coordinate frame, enabling spatial superimposition of geometric, colour, and thermal data. The system is designed to operate at close range, addressing three fundamental sensing challenges: platform motion, large field of view, and millimetre-level measurement accuracy. Preliminary laboratory results demonstrate synchronised multi-sensor acquisition and initial point cloud reconstructions, forming the basis for future airborne inspection trials.
comment: The Aerial Inspection for Marine Infrastructures (AIMI) workshop
A scalar per patch from pre-trained ViTs enables fast moving navigation in the real world ECCV 2026
Trained policies for real-world robotics rely on computer vision components, typically in the form of pre-trained visual encoders. These encoders are an essential component and it has been shown that their power does not emerge from training on robotics downstream losses alone. Pre-training with auxiliary losses in the form of computer-vision pre-text tasks is a defining factor and heavily conditions agent performance in robotics tasks. In this unprecedented large-scale study, we ran 966 navigation episodes of static point goal navigation in a real-world building for 24km and asked which components really matter for the computer vision aspects of robotics: we evaluate state-of-the art visual encoders in realistic conditions. We explore the usefulness of heterogeneous multi-teacher distillation leading to encoders with multiple different and complementary skills. We investigate how much information from these encoders is necessary for robotics by bottlenecking them in a principled and spatially useful way and we show that this leads to the emergence of interpretable features linked to affordances. We also argue that training policies on RGB data alone does not lead to an optimal usage of visual features and show this by finetuning policies pre-trained on privileged information. All in all, we paint a more complete picture of what aspects of computer vision are relevant for real-world navigation.
comment: European Conference on Computer Vision -- ECCV 2026
Discrete Geometric Modeling and Extended State Estimation of Continuum Robots
In this paper, we present a fully discrete approach for the accurate and numerically efficient dynamical modeling and state estimation of continuum robots. The model is based on geometrically exact beams in a minimal, strain-based formulation and derived in the framework of Lie group variational integrators, allowing to preserve important geometric properties that we exploit to achieve high accuracy and numerical efficiency. We then propose a disturbance observer based on an extended Kalman filter formulation that reliably estimates system states as well as model uncertainties and external disturbances. Experiments on a real system validate the accuracy and efficiency of the proposed model and observer.
comment: ©2026 Maximilian Herrmann, Leander Pfeiffer, Paul Kotyczka. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND
Remember what you did?: Learning Behavioral Memories for Partially Observable Object Manipulation
Long horizon, contact-rich manipulation is inherently partially observable. This is as a single visual observation rarely captures a robot's full action context, including prior attempts, interactions, or progress. Consequently, standard visuomotor policies or vision-language-action models are prone to struggle in such tasks due to a lack of memory. To address this, we introduce Compressed Action Memory Policy (CAMP) based on the insight that a robot's own action history serves as a highly informative, self-supervised signal, enabling the policy to learn a robust, compact history representation. In our approach, we train a memory module to maintain a compressed representation of past actions, forcing it to encode a latent behavioral memory of all the robot's past interactions that can then be used to better contextualize future actions. This allows our approach to implicitly track generalized task progress and learn from failed attempts without any additional supervision, or external oversight. We evaluate CAMP across four real-robot setups and two novel simulation benchmarks: Memory-T-Bench and Memory-Manip-Bench. By demonstrating substantial gains over state-of-the-art baselines, CAMP is, to our knowledge, the first policy to demonstrate substantial success on contact-rich partially observable manipulation tasks purely through learned memory.
comment: Project website: robo-camp.github.io
OmniV2X: A Generative Foundation Planner for Efficient End-to-End Cooperative Driving
We present OmniV2X, a generative foundation model for vehicle-to-everything (V2X) cooperative driving. The model directly interprets independent context sequences comprising multi-modal and multi-agent observations. The new design mitigates the computational cost of dense 3D perception, the vulnerability to data scarcity in cooperative scenarios, and the poor compliance with standardized messaging in existing methods that fuse multi-modal inputs into a shared representation. For training, we present an end-to-end supervised pipeline using a downstream trajectory generation loss, in which a high-capacity generative sequence planner implicitly learns to steer the model and leverage multi-modal inputs via cross-attention injection. As a foundation model, we demonstrate that OmniV2X pre-trained on large-scale single-agent planning datasets can efficiently adapt to cooperative environments by integrating the conditioning context with lightweight, standard-compliant V2X tokens. Evaluated on the DAIR-V2X-Seq dataset, OmniV2X outperforms existing end-to-end cooperative driving baselines, achieving state-of-the-art performance with less than 10% of the fine-tune V2X dataset and less than 1% of the communication bandwidth. We conduct comprehensive evaluations to demonstrate its computational efficiency and robustness under real-world constraints.
comment: 8 pages, 5 figures
DPLAN: Minimal Connectivity to Floorplan Generation
Automated floor plan generation is an important problem in computational architectural design. The goal is to construct a floor plan from user-defined room numbers and door requirements. The user specifies which rooms must share a door and which rooms must not be adjacent. However, these requirements do not determine the exact placement or shape of the rooms. The task is therefore to arrange the rooms in a single floor plan so that all required door connections are satisfied and no rooms overlap. To address this problem, we propose DPLAN (Door Connectivity to Floor Plan Generation), a graph-based prototype that generates floor plans from door and non-adjacency constraints. The framework operates in three stages. First, the user-defined graph is examined and, if disconnected, additional edges are added to connect its components. Second, a bi-connected plane triangulation is constructed to ensure the existence of a floor plan without overlapping rooms or empty spaces. Third, the triangulated graph is transformed into floor plans. For rectangular floor plans (RFPs), separating triangles are removed by modifying edges without adding new vertices, thereby avoiding the creation of extra rooms. For orthogonal floor plans (OFPs), separating triangles are removed by introducing additional vertices, allowing rectilinear room shapes. By enforcing both door and non-adjacency requirements, the framework generates floor plans that satisfy the given constraints. The method is implemented in Python and includes a prototype for interactive constraint specification and floor plan visualization. Currently, the framework supports rectangular plot boundaries. Future work includes support for non-rectangular plots, dimension-based scaling, and circulation modeling.
Pose-Agnostic Robotic Functional Grasping via Observation-Action Canonicalization
Functional robotic grasping requires a policy that generalizes across diverse object geometries and poses while maintaining task-specific contact precision. We study this challenge through mug-handle grasping, where thin handles, instance variation, and upright or inverted placements make both perception and control sensitive to object configuration. Grasp pose detection methods operate open-loop and are sensitive to estimation errors on thin handle structures. Learned visuomotor policies must implicitly learn to handle the coupled variation in visual appearance and action direction induced by different object placements, limiting generalization. We propose AnyMug, a canonicalized visuomotor reinforcement learning framework for functional grasping that trains a single closed-loop policy entirely in simulation and deploys it zero-shot on a real robot. AnyMug introduces observation-action canonicalization, which transforms both the depth observation and the predicted end-effector action into a shared object-centric frame. The policy therefore sees a consistent mug-centered view and emits actions in a canonical direction regardless of mug placement, allowing the same grasping behavior to be reused across configurations. A handle-aware reward further encourages precise approach, gripper alignment, and opposing-finger placement, while a pose curriculum and domain randomization improve training stability and sim-to-real transfer. In simulation, AnyMug achieves over 93% success rate on both unseen upright and inverted mugs and transfers zero-shot to a real Franka Panda, reaching 80% success rate on 5 held-out physical mugs across both pose categories.
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.
comment: Project Page: https://joon-stack.github.io/PoLAR
Horizon Adaptive Offline Policy Learning via Value Stitching
Learning accurate value functions plays a decisive role for reinforcement learning (RL) agents to solve long-horizon, complex tasks. Conventional temporal-difference (TD) learning objectives suffer from value-estimation bias that accumulates over the horizon, while extended-horizon modeling methods, such as n-step TD backups and Q-chunking, adopt a rigid, fixed-horizon value-modeling recipe that is often not flexible enough to capture complex value structures in long-horizon, multi-stage tasks. In this paper, we show that enabling value updates with dynamic horizon composition can yield a strong offline policy learning scheme. Our method, Horizon Adaptive Offline Policy Learning via VAlue STitching (VAST), replaces fixed-horizon backups with recursive, horizon-adaptive value composition. Its key ingredient is to couple value optimization with a future state- and horizon-length-conditioned auxiliary value function that is learned through direct data supervision, and a stitching policy that optimally selects the reward-maximizing horizon length and future sub-goal to achieve horizon-adaptive value stitching. This design enables direct estimation and compositional "stitching" of variable-length returns grounded in actionable sub-goal states, providing an accurate and greedily exploitable value-supervision signal for offline policy optimization. Across 50 tasks on OGBench, VAST outperforms fixed-step, extended-horizon methods, and generative-value offline RL baselines, achieving strong performance particularly in high-complexity, long-horizon decision-making tasks.
Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion ECCV 2026
Human motion generation has been widely studied across diverse input modalities, text, music, and video, and recent efforts have unified these into single multimodal frameworks. However, while morphological factors such as gender and body shape are known to produce distinct kinematic signatures, no existing unified framework incorporates this into generation, treating all subjects as morphologically equivalent. We present Odoriko, the first unified multimodal motion generation framework that reflects subject bio-morphological information directly in synthesized motion output. Rather than averaging over subject variation, Odoriko generates motion that is consistent with who is moving, not just what they are asked to do, across text, music, and video conditions within a single model. When explicit morphological information is unavailable, Odoriko additionally recovers subject morphology alongside motion, unifying estimation and generation in one framework. Extensive experiments across text-to-motion, music-to-dance, and video-to-motion benchmarks demonstrate that Odoriko matches or exceeds prior specialized models on standard metrics, while enabling morphology-consistent generation that no existing unified framework supports.
comment: ECCV 2026
Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization IROS 2026
The integration of pretrained encoders with diffusion policies has become a dominant paradigm for visual robotic manipulation. However, it still struggles to generalize across complex environments with varying factors such as lighting and surface textures. To address this, we propose FAME, a framework that integrates a factor-aware mixture-of-experts (MoE) with a pretrained encoder to enhance generalization to environmental variations. FAME follows a three-stage training process: (1) policy warmup, where a diffusion policy is trained on standard-environment data with a frozen encoder; (2) factor-specific adapter training, where lightweight adapters inserted between the frozen encoder and the temporarily frozen policy are trained on customized datasets, each targeting a distinct environmental variation; and (3) joint fine-tuning, where a central router and the warmed policy are trained on mixed data to handle multiple factors jointly. FAME is ``factor-aware'' because the central router softly weights frozen factor-specific adapters as a dense MoE, enabling combinatorial generalization across multiple factors. Evaluations on the Meta-World benchmark show that FAME outperforms diffusion policy baselines by 34%. We further validate FAME in a real-world pick-and-place task using a compact model trained on newly collected data, where FAME achieves a 35% improvement in generalization under real-world variations.
comment: 8 pages, 9 figures, accepted by the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
How Should a Robot Configure Its Laser Scanner for Inspection? IROS 2026
Robotic inspection relies on accurate sensing to acquire high-fidelity geometric measurements for defect detection and metrology. While prior work has focused on robot motion and viewpoint planning, how to configure sensing parameters remains largely underexplored, despite their decisive impact on measurement quality. We propose SenseHD, a robotic sensing system that formulates scanner configuration as an instruction-conditioned sensing decision. Instead of predicting precise parameter values, SenseHD treats sensing parameters as discrete sensing actions and selects stable sensing regimes through hyperdimensional associative memory. Experiments on a real robotic inspection platform demonstrate that SenseHD robustly selects appropriate sensing configurations and significantly improves inspection reliability, while remaining lightweight and efficient compared to baseline methods.
comment: 8 pages, 9 figures. Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
MV-WAM: Manifold-Aware World Action Model with Value Augmentation
Achieving robust and generalizable manipulation across diverse environments remains a fundamental challenge in embodied robotics. Recent world action models achieve strong in-domain performance, yet their gains do not extend proportionally to out-of-distribution scenarios. We attribute this to a structural mismatch between visual and action modalities, whose intrinsically heterogeneous manifolds cause joint optimization to disproportionately degrade action robustness under distribution shift. To address this, we propose MV-WAM, a novel end-to-end framework that jointly models visual prediction, action generation, and value estimation designed to effectively leverage video priors during both training and inference for enhanced action generalization. Key to this unification is a cross-modality causal mask that hierarchically grounds actions in predicted video frames and value function tokens in both modalities. To further narrow the generalization gap, MV-WAM adopts a manifold-aware optimization scheme that explicitly accounts for the structural heterogeneity across modalities. Finally, MV-WAM introduces a progress-value regulation mechanism that estimates task completion and detects misalignment between predicted frames and generated actions, enabling the policy to autonomously identify execution deviations and recover through value-guided rollback. On the RoboTwin simulation, MV-WAM achieves a 55.7% mean success rate on random scenarios without any randomized action supervision, outperforming the strongest baseline by 29.3%. MV-WAM achieves a 77.5% mean success rate across four real-world tasks of varying difficulty on a dual-arm robot. Our results demonstrate that manifold-aware cross-modal alignment is essential for robust policy generalization, offering a path toward deployable robotic manipulation.
comment: 20 pages, 9 figures, 7 tables
ReFPO: Reflow Regularization for Flow Matching Policy Gradients
We present Reflow-regularized Flow Matching Policy Gradients (ReFPO), a simple online RL method that adds explicit Reflow regularization to FPO for efficient flow-based control. We uncover a key structural property: the gradient updates in Flow Matching Policy Gradients (FPO) can be interpreted as an implicit advantage-weighted Reflow process, providing a new geometric perspective on flow-based policy gradients. Building on this insight, ReFPO introduces an explicit geometric regularizer that can be implemented with a single line of code change without incurring additional computational overhead or auxiliary distillation stages. By synergizing advantage-guided updates with path rectification, our method reduces CFM proxy-ratio spikes, stabilizes PPO-style training, and enables high-fidelity one-step inference that often matches or exceeds multi-step performance. We experimentally demonstrate that ReFPO improves average performance and discretization robustness across GridWorld, MuJoCo Playground, and high-dimensional Humanoid Control tasks, providing a scalable and stable approach for generative policies in complex physical simulations.
Membrane-based Acoustic Microrobots IROS 2026
Acoustic microrobots have emerged as a promising frontier for targeted drug delivery and minimally invasive medicine due to their high-power density and biocompatibility. Despite wide-ranging designs, conventional acoustic microrobots mostly rely on air microbubbles trapped within confined microcavities within the robot body, which suffer from limited operational longevity due to rapid gas dissolution and resultant shifts in resonance frequency. In this paper, we propose a robust, membrane-based acoustic microrobot that overcomes these limitations by employing a thin flexible Polydimethylsiloxane (PDMS) membrane bonded over confined microcavities for microstreaming. The introduced design physically prevents gas diffusion, ensuring stable performance over extended periods at high actuation voltages. We systematically characterized the membrane-based acoustic actuator longevity, demonstrating consistent streaming and propulsion for over 24 hours of continuous operation. In addition, by embedding magnetic microparticles into the structural body, these actuators were successfully employed as microswimmers with directional control using low-intensity (2 mT) external magnetic fields. Finally, we demonstrate the scalability of the proposed design architecture down to ~100 um. This membrane-based approach establishes a reliable framework for the development of high-endurance acoustic microactuators and microrobots capable of performing long-term tasks.
comment: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026). For supplementary video contact fkocabas@wisc.edu
BayesFP: Posterior Estimation for Flow-Based Policies via Feynman-Kac Sampling
Robots must generate trajectories that remain faithful to learned expert behavior while satisfying safety constraints and task-specific objectives specified only at inference time. We formulate constrained trajectory generation for pretrained diffusion and flow-matching policies as Bayesian posterior sampling, with the learned demonstration distribution as a prior and an inference-time, cost-derived likelihood tilting it toward feasible, optimal trajectories. To sample from this posterior without any retraining of the base policy, we leverage the Feynman--Kac corrector framework, originally formulated for diffusion models, and extend it to deterministic flow-matching policies. The result is a unified, inference-time, retraining-free sampler for diffusion and flow policies. We validate the approach on pretrained Diffusion Policy, GR00T-N1.6, and $π_{0.5}$ checkpoints across simulated and real-world manipulation tasks, including planning around non-convex obstacles introduced at inference time, and show improvements over the base $π_{0.5}$ on zero-shot tasks.
R2HandoverSim: A Simulation Framework and Benchmark for Robot-to-Human Object Handovers IROS 2026
We present R2HandoverSim, a simulation benchmark for robot-to-human (R2H) object handovers. Although R2H handover methods have advanced rapidly, the lack of standardized evaluation protocols impedes objective comparison. Our benchmark enables reproducible evaluation by systematically comparing four baselines on their predicted shared grasp poses. We conduct a user study with 30 participants, analyze baseline performance, and show that simulation results correlate with real-world evaluation outcomes. Crucially, five complementary metrics (planning feasibility, reachability, grasp stability, grasp affordance, and safety) better reflect user-perceived handover quality than overall success rate alone. Website and code: https://robot-future.github.io/r2handoversim/.
comment: Accepted by the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Inductive Generalization for Robotic Manipulation
Understanding the generalization capabilities of visuomotor policies is essential in the development of capable robotic agents. Generalizable models learn structures that transfer across domains. However, in practice, visuomotor policies test performance by interpolation on known distributions using unstructured domain shifts (e.g. lighting, clutter, diverse objects). We argue that to measure generalization capabilities we must instead test the inductive capacity of policies on progressively harder, out-of-distribution task variants. We call this inductive generalization, drawing directly on how axis-based evaluation has revealed inherent generalization limitations in language models (e.g. sequence length, counting) arXiv:2502.00197 . We provide a reusable and formal evaluation protocol for measuring inductive generalization in any manipulation policy, and establish baselines showing that existing paradigms fail this test; e.g. SoTA Vision-Language-Action models and find that policies that appear to generalize to prior domain shifts (distractors, etc) fail inductive generalization tests. These results expose a class of learning challenges orthogonal to those addressed by data and model scaling in robot learning, yet are imperative to solve in order to realize general purpose robots.
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-training ECCV 2026
Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.
comment: Accepted to ECCV 2026. Maciej and Jesper contributed equally
DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that improves the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we introduce a surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on 60+ simulation and real-world manipulation tasks, notably showing successful data selection from the largest open collections of robot datasets (OXE); demonstrating consistent gains in success rates over prior works. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics. More information at https://robin-lab.cs.utexas.edu/datamodels4imitation/
HumanHalo -- Safe and Efficient 3D Navigation Among Humans via Minimally Conservative MPC
Safe and efficient robotic navigation among humans is essential for integrating robots into everyday environments. Most existing approaches focus on simplified 2D crowd navigation and fail to account for the full complexity of human body dynamics beyond root motion. We present HumanHalo, a Model Predictive Control (MPC) framework for 3D Micro Air Vehicle (MAV) navigation among humans that combines theoretical safety guarantees with data-driven models for realistic human motion forecasting. Our approach introduces a novel twist to reachability-based safety formulation that constrains only the initial control input for safety while modeling its effects over the entire planning horizon, enabling safe yet efficient navigation. We validate HumanHalo in both simulated experiments using real human trajectories and in the real-world, demonstrating its effectiveness across tasks ranging from goal-directed navigation to visual servoing for human tracking. While we apply our method to MAVs in this work, it is generic and can be adapted by other platforms. Our results show that the method ensures safety without excessive conservatism and outperforms baseline approaches in both efficiency and reliability.
Scaling Self-Play for End-to-End Driving
End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning
Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and manipulation stability. In this work, we present HapTile, a contact-grounded visuotactile manipulation dataset that advances beyond vision-only trajectory datasets by embedding physical interaction sensing at two levels: fingertip tactile feedback at the robot end-effector, and haptic-informed demonstrations at the teleoperator side. The data collection platform integrates haptic feedback directly into the teleoperation controller, enabling the operator to perceive contact interactions in real time. It is built around a standard and reproducible robotic system equipped with custom-designed fingertip tactile sensors. The dataset comprises everyday manipulation tasks spanning a broad range of contact-rich skills, including pick-and-place, folding, pressing, stacking, and other routine activities. Each task is paired with language instructions that condition the policy on the manipulation objective, together with synchronized visuotactile observations and action trajectories. In addition, we provide a benchmarking study on contact-rich policy learning using two baseline models to evaluate the effectiveness of the proposed contact-grounded dataset. The dataset and additional details are available on our website: haptile-dataset.github.io.
TriLift: Interpolation-Free Tri-Plane Lifting for Efficient 3D Perception on Embedded Systems
Dense 3D convolutions provide high accuracy for perception but are too computationally expensive for real-time robotic systems. Existing tri-plane methods rely on 2D image features with interpolation, point-wise queries, and implicit MLPs, which makes them computationally heavy and unsuitable for embedded 3D inference. As an alternative, we propose TriLift, a novel interpolation-free tri-plane lifting and volumetric fusion framework that directly projects 3D voxels into plane features and reconstructs a feature volume through broadcast and summation. This shifts nonlinearity to 2D convolutions, reducing complexity while remaining fully parallelizable. To mitigate spatial information loss inherent in projections, we incorporate a lightweight adaptive positional encoding module that helps bridge the spatial information gap, dynamically recovering fine geometric details with negligible overhead. To capture global context, we add a low-resolution volumetric branch fused with the lifted features through a lightweight integration layer, yielding a design that is both efficient and end-to-end GPU-accelerated. To validate the effectiveness of the proposed method, we conduct experiments on classification, completion, segmentation, and detection, and we map the trade-off between efficiency and accuracy across tasks. Results show that classification and completion retain or improve accuracy, while segmentation and detection show a trade-off, significantly reducing computational demand with only a slight decrease in accuracy. On-device benchmarks on an NVIDIA Jetson Orin Nano confirm robust real-time throughput, demonstrating the suitability of the approach for embedded robotic perception.
FILIC: Dual-Loop Force-Guided Imitation Learning with Impedance Torque Control for Contact-Rich Manipulation Tasks
Many contact-rich manipulation tasks require precise force regulation. However, most imitation learning (IL) policies remain position-centric and lack explicit force awareness, and adding force/torque sensors to collaborative robot arms is often costly and requires additional hardware design. To overcome these issues, we propose FILIC, a Force-guided Imitation Learning framework with impedance torque control. FILIC integrates a Transformer-based IL policy with an impedance controller in a dual-loop structure, enabling compliant force-informed, force-executed manipulation. For robots without force/torque sensors, we introduce a cost-effective end-effector force estimator using joint torque measurements through analytical Jacobian-based inversion while compensating with model-predicted torques from a digital twin. Experiments show that FILIC significantly outperforms vision-only and joint-torque-based methods, achieving safer, more compliant, and adaptable contact-rich manipulation. The source code is available at https://github.com/OpenGHz/mujoco-wrench-estimator.git.
SURE: Safe Uncertainty-Aware Robot-Environment Interaction using Trajectory Optimization
Robotic tasks involving contact interactions pose significant challenges for trajectory optimization due to discontinuous dynamics. Conventional formulations typically assume deterministic contact events, which limit robustness and adaptability in real-world settings. In this work, we propose SURE, a robust trajectory optimization framework that explicitly accounts for contact timing uncertainty. By allowing multiple trajectories to branch from possible pre-impact states and later rejoin a shared trajectory, SURE achieves both robustness and computational efficiency within a unified optimization framework. We evaluate SURE on two representative tasks with unknown impact times. In a cart-pole balancing task involving uncertain wall location, SURE achieves an average improvement of 21.6% in success rate when branch switching is enabled during control. In an egg-catching experiment using a robotic manipulator, SURE improves the success rate by 40%. These results demonstrate that SURE substantially enhances robustness compared to conventional nominal formulations.
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input - a golden ticket. We propose simple search methods to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 46 out of 51 tasks across simulated and real-world robot manipulation benchmarks, with absolute improvements in success rate by up to 55% for some simulated tasks, and 28% within 60 search episodes for real-world tasks. Our approach naturally extends to multi-task settings, where we improve the average performance of a VLA policy across 7 tasks by 14% using a single noise vector. Further, we find that, a golden ticket optimized for one task can also boost performance in other related tasks for the same VLA policy. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.
comment: 34 pages, 14 figures, 12 tables
ZeroDex: Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning
We present ZeroDex, a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.
FGGS-LiDAR: Ultra-Fast, GPU-Accelerated Simulation from General 3DGS Models to LiDAR
While 3D Gaussian Splatting (3DGS) has emerged as a strong representation for photorealistic rendering, its vast ecosystem of assets remains difficult to use for high-performance LiDAR simulation, a critical tool for robotics and autonomous driving. We present \textbf{FGGS-LiDAR}, a geometry-first framework that bridges this gap in a plug-and-play manner. Our method converts pretrained 3DGS assets into watertight meshes directly from Gaussian parameters, without requiring LiDAR-specific supervision or architectural alterations, via volumetric discretization and Truncated Signed Distance Field (TSDF) extraction. We pair this with a GPU-accelerated ray-casting module that simulates LiDAR returns at over 500 FPS and supports batched multi-environment simulation with up to 4096 environments. In large-scale parallel settings, FGGS-LiDAR achieves an order-of-magnitude lower LiDAR simulation latency than Isaac Sim. We validate FGGS-LiDAR on both indoor and outdoor scenes, demonstrating high LiDAR-simulation fidelity. Furthermore, on COLMAP-posed indoor benchmarks, we compare against existing 3DGS-to-mesh baselines and report lower LiDAR-simulation error. Code is available at https://github.com/discoverse-dev/FGGS-LiDAR.
Leaderless Collective Motion in Affine Formation Control over the Complex Plane
We propose a method for the collective maneuvering of affine formations in the plane by modifying the original weights of the Laplacian matrix used to achieve static formations of robot swarms. Specifically, the resulting collective motion is characterized as a time-varying affine transformation of a reference configuration, or shape. Unlike the traditional leader-follower strategy, our leaderless scheme allows agents to maintain distinct and possibly time-varying velocities, enabling a broader range of collective motions, including all the linear combinations of translations, rotations, scaling and shearing of a reference shape. Our analysis provides the analytic solution governing the resulting collective motion, explicitly designing the eigenvectors and eigenvalues that define this motion as a function of the modified weights in the new Laplacian matrix. To facilitate a more tractable analysis and design of affine formations in 2D, we propose the use of complex numbers to represent all relevant information. Simulations with up to 20 agents validate the theoretical results.
comment: Accepted for publication in IEEE Transactions on Control of Network Systems (TCNS), 12 pages
A Geometric Task-Space Port-Hamiltonian Formulation for Redundant Manipulators
We present a novel geometric port-Hamiltonian formulation of redundant manipulators performing a differential kinematic task $η=J(q)\dot{q}$, where $q$ is a point on the configuration manifold, $η$ is a velocity-like task space variable, and $J(q)$ is a linear map representing the task, for example the classical analytic or geometric manipulator Jacobian matrix. The proposed model emerges from a change of coordinates from canonical Hamiltonian dynamics, and splits the standard Hamiltonian momentum variable into a task-space momentum variable and a null-space momentum variable. Properties of this model and relation to Lagrangian formulations present in the literature are highlighted. Finally, we apply the proposed model in an \textit{Interconnection and Damping Assignment Passivity-Based Control} (IDA-PBC) design to stabilize and shape the impedance of a 7-DOF Emika Panda robot in simulation.
Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing IROS 2026
Autonomous drone racing (ADR) demands state estimation that is simultaneously computationally efficient and resilient to the perceptual degradation experienced during extreme velocity and maneuvers. Traditional frameworks typically rely on conventional visual-inertial pipelines with loosely-coupled gate-based Perspective-n-Points (PnP) corrections that suffer from a rigid requirement for four visible features and information loss in intermediate steps. Furthermore, the absence of GNSS and Motion Capture systems in uninstrumented, competitive racing environments makes the objective evaluation of such systems remarkably difficult. To address these limitations, we propose ADR-VINS, a robust, monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) tailored for autonomous drone racing. Our approach integrates direct pixel reprojection errors from gate corners features as innovation terms within the filter. By bypassing intermediate PnP solvers, ADR-VINS maintains valid state updates with as few as two visible corners and utilizes robust reweighting instead of RANSAC-based schemes to handle outliers, enhancing computational efficiency. Furthermore, we introduce ADR-FGO, an offline Factor-Graph Optimization framework to generate high-fidelity reference trajectories that facilitate post-flight performance evaluation and analysis on uninstrumented, GNSS-denied environments. The proposed system is validated using TII-RATM dataset, where ADR-VINS achieves an average RMS translation error of 0.143 m, while ADR-FGO yields 0.060 m as a smoothing-based reference. Finally, ADR-VINS was successfully deployed in the A2RL Drone Championship Season 2, maintaining stable and robust estimation despite noisy detections during high-agility flight at top speeds of 20.9 m/s. We further utilize ADR-FGO for post-flight evaluation in uninstrumented racing environments.
comment: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
HRDexDB: A Paired Human-Robot Dataset for Cross-Embodiment Dexterous Grasping
We present HRDexDB, a paired cross-embodiment dexterous grasping dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a dedicated multi-camera system, HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. The dataset comprises 2.1K grasping trials, each enriched with synchronized visual and kinematic modalities, with contact-force signals available for tactile-enabled robotic hands. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for cross-embodiment dexterous manipulation.
APEX: Action Priors Enable Efficient Exploration for Robust Motion Tracking on Legged Robots
Learning natural, animal-like locomotion from demonstrations has become a core paradigm in legged robotics. While motion tracking can reproduce reference gaits, many approaches still require substantial tuning and depend on reference motion inputs at deployment, which can limit responsiveness to task objectives and reduce adaptability. We present APEX (Action Priors enable Efficient eXploration), a motion-tracking reinforcement learning (RL) framework that removes deployment-time dependence on reference motion inputs, improves sample efficiency, and reduces tuning effort. APEX integrates demonstrations into RL via decaying action priors, which guide early exploration toward demonstration-consistent actions and then fade to zero, yielding a pure RL policy at deployment. This is combined with a multi-critic framework that separates style and task + regularization learning signals. Moreover, APEX enables a single policy to learn diverse motions and transfer reference-like styles across different terrains and velocities, while remaining robust to variations in training parameters. We validate our method in simulation on both humanoid and quadruped robots, and with zero-shot deployment on a Unitree Go2 robot. Website and code: https://marmotlab.github.io/APEX/.
comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment
The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spatial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable attention to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning.
EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning
Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.
comment: 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.
GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation ICRA 2025
Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in part-based interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a large-scale part-centric dataset for articulated object manipulation that features both photo-realistic material randomization and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios. More information and demos can be found at: https://pku-epic.github.io/GAPartManip/.
comment: Accepted by ICRA 2025. Project page: https://pku-epic.github.io/GAPartManip/
Behavior Cloning Under PD Control: A Finite-Horizon Theory of Gain-Dependent Error Amplification
Behavior cloning (BC) on position-controlled robots is shaped by the PD loop that executes policy actions. We give a finite-horizon, nonasymptotic analysis of how controller gains affect BC failure. Independent sub-Gaussian action errors propagate through gain-dependent closed-loop dynamics into sub-Gaussian position errors. The resulting failure tail is controlled by controller amplification multiplied by validation loss and generalization slack, so validation loss alone can mis-rank gains. Under shape-preserving upper-bound assumptions, the analysis separates label difficulty, injection strength, and contraction, ranking compliant-overdamped gains as tightest and stiff-underdamped gains as loosest, with the mixed regimes system-dependent. In the canonical scalar second-order PD system, stationary position-error variance increases with stiffness and decreases with damping over the stable range, and exact zero-order-hold discretization inherits the ordering to leading order. This extends the error-attenuation explanation of bronars et al. (2026) to finite-horizon failure bounds.
GenTrack: A New Generation of Multi-Object Tracking
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
comment: The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication
Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation
Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.
Self-Curriculum Model-based Reinforcement Learning for Shape Control of Deformable Linear Objects IROS 2026
Precise shape control of Deformable Linear Objects (DLOs) is crucial in robotic applications such as industrial and medical fields. However, existing methods face challenges in handling complex large deformation tasks, especially those involving opposite curvatures, and lack efficiency and precision. To address this, we propose a two-stage framework combining Reinforcement Learning (RL) and online visual servoing. In the large-deformation stage, a model-based reinforcement learning approach using an ensemble of dynamics models is introduced to significantly improve sample efficiency. Additionally, we design a self-curriculum goal generation mechanism that dynamically selects intermediate-difficulty goals with high diversity through imagined evaluations, thereby optimizing the policy learning process. In the small-deformation stage, a Jacobian-based visual servo controller is deployed to ensure high-precision convergence. Simulation results show that the proposed method enables efficient policy learning and significantly outperforms mainstream baselines in shape control success rate and precision. Furthermore, the framework effectively transfers the policy trained in simulation to real-world tasks with zero-shot adaptation. It successfully completes all 30 cases with diverse initial and target shapes across DLOs of different sizes and materials. The project website is available at: https://anonymous.4open.science/w/sc-mbrl-dlo-EB48/
comment: Accepted to IROS 2026
ScanBot: A Benchmark for Precision Robotic Surface Scanning with Industrial Laser Profilers IROS 2026
We introduce ScanBot, a benchmark for instruction-conditioned, high-precision surface scanning with robot-mounted industrial laser profilers. Unlike existing robot learning datasets that emphasize coarse behaviors such as grasping, navigation, or dialogue, ScanBot targets sensing-centric tasks where sub-millimeter motion continuity, strict stand-off control, and stable scanner settings are essential for acquiring usable geometry. The dataset contains scanning trajectories over twenty objects, including electronic components and structured 3D-printed parts, and spans six task types that range from broad inspection to fine-grained detail scanning and geometry-critical operations, including metrology and registration. Each episode is specified by natural language instructions and paired with synchronized first-person RGB-D, third-person video, laser height profiles, robot joint and pose traces, and scanner-parameter logs. These requirements expose a gap: despite recent progress, learning-based models often fail to produce stable and feasible scan motions under fine-grained instructions and real laser-profiling constraints. To reflect how industrial scanning is actually done, we evaluate methods through a two-stage pipeline. Stage I asks the model to "set up the sensor" by recommending scanner parameters, while Stage II asks it to "move like a scanner" by producing smooth, feasible trajectories that maintain stand-off and cover the intended region under precision demands.
comment: 9 pages, 7 figures. Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Temporal Self-Imitation Learning
Long-horizon robot manipulation policies trained with reward shaping can still achieve high return through inefficient interactions, while rare efficient behaviors discovered during training may be forgotten. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
ASC-SW: A Lightweight Atrous Strip Convolution Network for DLOs Segmentation on Edge mobile Robots IROS2026
Detecting deformable linear objects (DLOs), such as floor cables, is essential for safe mobile robot navigation but remains challenging due to oblique viewpoints, thin structures, and limited edge-device resources. Existing DLO segmentation methods are primarily designed for manipulator platforms with fixed top-down views and often require heavy models, limiting their deployment on mobile robots. We formulate a cross-view DLO segmentation problem, where models trained on manipulator-view data must generalize to mobile robot perspectives. To address this, we propose ASC-SW, a lightweight and geometry-aware segmentation framework. The core network, ASCNet, introduces Atrous Strip Convolution, combining directional strip filtering with dilated receptive fields to enhance sensitivity to elongated structures at low computational cost. An Atrous Strip Convolution Spatial Pyramid Pooling module enables multi-scale anisotropic feature aggregation, while a temporal Sliding Window refinement suppresses viewpoint-induced false positives. Evaluated on real-world mobile robot data, ASC-SW achieves 74.1% mIoU at 261 FPS and remains deployable on edge devices.
comment: paper update: Rejected by IROS2026 on June 17th
Build Once, Monitor Continuously: Persistent Semantic Mapping via Autonomous Exploration and Open-Vocabulary Object Updates
Persistent semantic monitoring of indoor spaces such as warehouses, hospitals, and offices requires a robot to repeatedly monitor an environment and track how objects change over time. Running full simultaneous localization and mapping (SLAM) with dense semantic reconstruction from scratch on every visit is redundant when the environment geometry stays the same and only the objects move. We present a modular two-stage system that separates geometric mapping from semantic updating. In the first stage, a frontier-based exploration method with a dynamic search window builds a 2D occupancy grid. In the second stage, the robot relocalizes in this map and builds a semantic object graph using an open-vocabulary object detector and a promptable segmentation model. Only the lightweight semantic stage is repeated on later visits, so the system scales well to frequent revisits. The object graph uses a category and distance based association rule to update objects, which lets the map reflect both intra-session changes (object changes within a single traversal) and inter-session changes (changes across revisits), such as objects being moved, removed, or added. We validate the system on a Fetch robot in two real indoor environments of about 8,500 sq.m and 117 sq.m, and report precision, recall, and F1 scores across multiple update iterations.
comment: 10 pages, 8 figures, 5 tables. Project page is available at https://irvlutd.github.io/SemanticMapping/
Intent-Handover: Grounding Language in Human-Usage Regions for Trustworthy Robot-to-Human Handovers IROS 2026
Spoken instructions in robot-to-human handovers may specify either an object ("the cup") or an intended use ("pour water"); in both cases, successful handover requires the robot to infer the target object and the region remaining available for the human to hold. If the robot grasps that hold region, the object could become awkward to receive and immediately use, potentially reducing perceived competence and trust; if the gripper approaches too close to the receiving hand during delivery, perceived safety may also suffer. We present Intent-Handover, which grounds unconstrained speech and visual scene context into explicit grasp and delivery constraints. Given a spoken instruction and a scene observation, a vision-language model identifies the target object and the intended human-usage region. A grasp optimization module then selects a feasible grasp keeping this region accessible while enforcing clearance from the predicted receiving hand. During execution, the robot tracks upper-body key points to estimate the user's receiving pose and places the handover at an ergonomically feasible location. In a within-subjects ablation study (n=30), human-usage region awareness increases perceived trust, hand-gripper collision avoidance increases perceived safety, and interaction comfort is highest when both are enabled. Website and code: https://robot-future.github.io/intent-handover/.
comment: Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
FARM: Find Anything using Relational Spatial Memory
Robots operating in homes, warehouses, and other object-rich environments need memory systems that can find specific object instances on demand. Object-level memory alone is often insufficient: scenes contain many plausibly matching objects, and users refer to the target through relations to landmarks and surrounding objects (e.g. ``the tall lamp below the dartboard and to the left of the poster''), demanding a relational spatial memory that supports retrieval through semantic, appearance, and spatial predicates over objects. To achieve this, we present FARM (Find Anything using Relational Spatial Memory), which builds, in real time at 5-10 Hz, a compact, open-vocabulary, object-level memory with geometry, visual-language descriptors, and viewpoint evidence. At query time, FARM uses VLMs to parse the query and score visual evidence, while grounding spatial constraints explicitly through object symbols and relational predicates. This structured use of VLMs enables more accurate and robust retrieval than end-to-end reasoning over frame histories or scene-graph context. In experiments on 44k language queries spanning 67 indoor and outdoor scenes, ranging from 15 to 15,000 m^2, FARM improves Recall@5 and Recall@10 over prior methods by 164% and 224%, and a final VLM reranking stage improves Accuracy@1 by 35%, while running in real time. We further demonstrate closed-loop deployment on a quadrupedal robot using onboard sensors and compute.
Multiagent Systems
Cohort Organized Learning: Clustering Through Agreement
In this article we describe Cohort Organized Learning (CoOL), a method for clustering data without explicit distance or similarity computations. Herein, we will describe CoOL, derive the gradients determined by expectation maximization to train the networks, show how to monitor convergence during training and evaluate the clusters after training, and discuss a series of examples and use cases. We also discuss CoOL's limitations and future prospects on related tasks. Because CoOL uses neural networks to estimate the clusters, it can be used to cluster any data that can be made compatible and we illustrate this on vector data and images.
comment: 20 pages with 14 figures and 4 tables
Hallucination as Context Drift: Synchronization Protocols for Multi-Agent LLM Systems
Multi-agent LLM systems routinely produce hallucinated outputs that cannot be explained by model deficiencies alone. A significant class of these failures arises not from model incapacity but from context drift: the divergence of internal knowledge states between concurrent agents. When agents enter a collaborative task with mismatched or stale representations of shared world state, their joint reasoning produces contradictions that manifest as hallucination. We define the Context Divergence Score (CDS), a lightweight scalar metric quantifying knowledge-state discrepancy between agent pairs across spatial, temporal, and task dimensions, and propose the Shared State Verification Protocol (SSVP), which lets agents periodically exchange compressed state summaries and flag high-divergence conditions before joint reasoning. We evaluate SSVP across two domains (multi-agent travel and software project planning) using Claude Haiku. In controlled experiments (n=30 per condition, travel; n=10, software) across 8 scenarios, naive full-broadcast synchronization increases hallucination rate by 34% above the no-sync baseline (HR: 0.658 vs. 0.492, p=0.0022, d=1.18), a contamination effect from propagating erroneous agent states. SSVP avoids this failure mode while showing modest, consistent reduction (HR: 0.463, d=0.30) and achieves significantly lower hallucination than full-broadcast (p=0.0005, d=1.47) using 58% fewer API calls. The contamination effect does not replicate in the software domain, where all conditions converge to low HR (<0.2), confirming it is specific to tasks where one erroneous shared belief cascades across evaluation dimensions. Our results reframe hallucination mitigation as a distributed systems problem and establish context synchronization as a first-class primitive in multi-agent LLM design.
comment: 11 pages, 1 figure
Monitoring Diameters of Causal Communication Graph with Spatio-Temporal Logic
Verification of multi-agent systems requires the ability to check meticulous topological properties when it comes to agents that can move through space in continuous time. This demands a logic with sufficient expressiveness to capture these dynamics. MuTGL logic has interesting properties for expressing entangled space-time properties. However, this logic lacks the expressivity needed to analyse reachability within specific distance bounds, or to track the length or the cost of communication chains: these are fundamental for decentralized monitoring, or graph-theoretic analysis of distributed protocols, where algorithmic complexities often relates with the system's communication graph diameter. We then introduce an extension of muTGL, including a new operator called the space horizon. This addition allows us to bound the distance of communication chains, hence enhancing the logic's expressiveness. We show that this operator allows to encode modalities from other logics, such as reachability or escaping which were not available in vanilla muTGL, while allowing a deeper entanglement of spatial and temporal properties. We provide a centralized offline monitoring algorithm for this logic and illustrate it on several examples on simulations of Consensus-Based Bundle Algorithms, distributed protocols for task allocation.
Simultaneously Efficient Allocation of Indivisible Items Across Multiple Dimensions
Many allocation problems are intrinsically multidimensional, since an item may contribute differently to several criteria, and optimizing a single aggregate objective can hide severe losses in other dimensions. We study how much efficiency can be guaranteed simultaneously when indivisible items have multiple attributes. To this end, we introduce the \emph{multidimensional efficient allocation} (MDEA) model, where each agent has an additive valuation in each dimension, and investigate simultaneous efficiency under utilitarian social welfare (USW) and egalitarian social welfare (ESW). Our results reveal a sharp worst-case frontier. For exact efficiency, maximizing the number of dimensions attaining the USW optimum admits a $c/\ell$-approximation for every fixed constant $c$, and this dependence on the number $\ell$ of dimensions is essentially unavoidable; for ESW, even deciding whether two dimensions can be optimized simultaneously is NP-hard with binary valuations. For approximate simultaneous efficiency in every dimension, we identify a tight threshold of order $1/\ell$, showing that such guarantees always exist for both USW and ESW, while any asymptotically better dependence on $\ell$ is impossible, even for binary valuations. Finally, we introduce three natural multidimensional Pareto notions and characterize both their relationships and their computational complexity.
Negative Knowledge as Failure-aware Shared Memory for AutoResearch
AI-assisted research systems generate many failed attempts, but those failures rarely become a durable, shared knowledge asset. We propose a negative knowledge memory layer: a curator agent converts each failed attempt into a bounded, typed record in a shared bank, and a downstream research agent explicitly adopts or rejects those records before proposing its next experiment. We evaluate this layer in two settings: same-task retry on ScienceAgentBench and cross-task scientific research on two nonlinear math-physics PDE problems. The negative knowledge layer outperforms vanilla AutoResearch baselines while using fewer tokens; agents with the negative knowledge bank solve new tasks that all baselines fail to solve in PDE systems research. We also show that the previous negative knowledge bank can transfer and enhance AutoResearch on different PDE problems. These results suggest that structured negative knowledge is a knowledge asset that should be explicitly maintained in broader AI-engaged scientific research beyond a memory-compression or debugging aid, alongside positive findings, as a collective infrastructure for scientific memory. Code is available at https://github.com/hch-wang/Negative_Knowledge.
comment: 13 pages, 4 figures
Towards Adaptive Categories: Dimensional Governance for Agentic AI
As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks -- based on fixed risk tiers, levels of autonomy, or human oversight models -- are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this article, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling pre-emptive adjustments before risks materialise. This dimensional approach provides the necessary foundation for more adaptive categorisation, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail -- and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.
comment: 12 pages core text, 15 pages including references, 2 figures
Seeding with Differentially Private Network Information AAMAS 2023
In public health interventions such as distributing preexposure prophylaxis (PrEP) for HIV prevention, decision makers often use seeding algorithms to identify key individuals who can amplify intervention impact. However, building a complete sexual activity network is typically infeasible due to privacy concerns. Instead, contact tracing can provide influence samples, observed sequences of sexual contacts, without full network reconstruction. This raises two challenges: protecting individual privacy in these samples and adapting seeding algorithms to incomplete data. We study differential privacy guarantees for influence maximization when the input consists of randomly collected cascades. Building on recent advances in costly seeding, we propose privacy-preserving algorithms that introduce randomization in data or outputs and bound the privacy loss of each node. Theoretical analysis and simulations on synthetic and real-world sexual contact data show that performance degrades gracefully as privacy budgets tighten, with central privacy regimes achieving better trade-offs than local ones.
comment: Preliminary version in AAMAS 2023: https://dl.acm.org/doi/10.5555/3545946.3599081 -- Code and data: https://github.com/aminrahimian/dp-inf-max
Ahoy: LLMs Enacting Multiagent Interaction Protocols
An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming-free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an Ahoy agent can correctly and intelligently enact multiple protocols - concurrently if appropriate to the user goal - without specialized training. Ahoy's significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.
comment: Presented at EMAS 2026
Systems and Control (EESS)
Nonholonomic directional pursuit and evasion: Global feedbacks
In a recent paper by the second coauthor, directional pursuit-evasion for strictly forward-moving nonholonomic vehicles was solved "half-globally", namely under favorable initial line-of-sight conditions. In this paper, we develop feedback designs that achieve the directional pursuit and evasion objectives from arbitrary initial relative configurations. We achieve globality with completely different approaches to both the design and the analysis. Our designs are less aggressive in both the forward-speed and steering laws, allowing transient overshoot of the pursuer-evader range during global reorientation. The only price that we pay for globality is that our feedback laws require a priori knowledge of the opponent's maximal turning rate, whereas in the half-global work, no known bound of the opponent's turning rate was assumed. For the pursuit problem, the feedback law guarantees finite-time capture with prescribed directional alignment. For the evasion problem, the feedback law guarantees capture avoidance with a prescribed safety margin and achieves spinaway under a decay condition on the pursuer's turning rate. We illustrate the global capture and spinaway with simulations. The analysis is based on an integral input-to-state stability type mechanism induced by an endogenous time dilation, together with finite-time coextinction and safety-margin persistence lemmas for singularly coupled scalar inequalities.
Adaptive 5G Resource Allocation for Multistatic ISAC-Based UAV Detection and Tracking SC 2026
Unmanned aerial vehicles (UAVs) enable numerous commercial and public-safety applications, yet they also create security risks near critical infrastructure, transportation hubs, and restricted airspace. While integrated sensing and communications (ISAC) can leverage existing wireless networks for UAV surveillance, practical deployment must address competition between sensing and communication demands, as well as the challenges associated with tracking highly maneuverable UAVs with low radar cross section (RCS). This paper investigates adaptive multistatic ISAC for load-aware UAV detection and tracking in 5G wireless networks. A shared-resource framework is developed to quantify how sensing waveform length, sensing transmission rate, and beam allocation affect communication throughput in a 5G new radio (NR) system. Detection performance is analyzed using Zadoff-Chu (ZC) sensing waveforms, while tracking continuity is evaluated through an M-of-N detection model. To improve robustness under congestion, software-defined sensor (SDS) nodes exploit external signals of opportunity (SoO) to provide supplemental passive sensing opportunities when network resources become limited. Results show that adaptive sensing policies outperform fixed sensing reservations by preserving throughput under dynamic load while maintaining useful sensing capability. Under heavy congestion, SDS assistance substantially reduces tracking outage in the simulated scenarios. Cramer-Rao lower bound (CRLB) analysis demonstrates that multistatic sensing geometries improve localization accuracy and provide more uniform spatial coverage than monostatic sensing alone. These results highlight coordinated adaptive sensing and distributed multistatic support as a practical path toward resilient UAV surveillance in future wireless networks.
comment: Accepted for publication in the Proceedings of the IEEE/AIAA 45th Digital Avionics Systems Conference (DASC 2026). 10 pages, 9 figures
Stability Enhancement of Centralized UPS Data Center Systems Under Weak-Grid Conditions
Data center power systems are increasingly exposed to weak-grid conditions due to the evolution of modern power systems and the integration of large and dynamic loads. In centralized uninterruptible power supply (UPS) architectures, the front-end rectifier plays a critical role in maintaining stable operation and ensuring reliable power delivery to information technology (IT) equipment. However, conventional phase-locked loop (PLL)-based proportional-integral (PI) control strategies may exhibit degraded performance or instability under low short-circuit ratio (SCR) conditions. This paper investigates the behavior of centralized UPS systems under weak-grid conditions and demonstrates, through electromagnetic transient simulations, that PI-controlled rectifiers can become unstable at SCR=2. To address this issue, a power-based control approach is applied to the three-phase rectifier, enabling direct regulation of active and reactive power without relying on PLL synchronization. Simulation results show that the proposed control strategy improves system damping and restores stable operation under weak-grid conditions. The findings highlight the importance of control design for maintaining reliable operation of data center power systems in emerging low-strength grid environments.
Reference-Free, Long-Horizon Trajectory Optimization for Aggressive Autonomous Driving in Milliseconds
Autonomous vehicles must generate long-horizon and dynamically feasible trajectories in real time-even when operating at the limits of vehicle handling-to ensure safe operation in adverse conditions. However, existing work rarely quantifies the computational demands of generating such trajectories without prior references, warm starts and often defaults to low-fidelity models, compromising accuracy and control authority. We investigate the modeling and solver design choices that enable real-time solution of long-horizon, reference-free optimal control problems (OCPs) using full vehicle dynamics. To this end, we analyze vehicle stiffness properties to justify the OCP's integration scheme and show that lower-order A-stable methods consistently outperform alternatives, with solve time differences reaching two orders of magnitude. We show that robust nonlinear solver performance hinges on understanding barrier parameter update strategies and safeguarding techniques for Hessian indefiniteness, inherent in some interior point methods. Lastly, we propose a computationally efficient method for generating initial guesses using dynamic equilibrium, unlocking real-time performance and reducing initial infeasibility by up to four orders of magnitude. Extensive benchmarking and high-fidelity BeamNG simulation demonstrate compute times as low as 55 ms over a 260 m horizon, including high-speed obstacle avoidance scenarios where drifting emerges as a necessary component of feasible trajectory generation.
Revisiting the generalized first-order reset element with shaping filters
Reset control provides a nonlinear approach for improving closed-loop performance beyond the limitations of linear time-invariant controllers. However, the reset action inevitably introduces higher-order harmonics, which may degrade tracking performance, distort the reset signal, and reduce the reliability of frequency-domain predictions obtained via describing-function analysis. This paper revisits the generalized first-order reset element with shaping filters and develops a systematic framework for suppressing undesired reset-induced nonlinearities. Analytical conditions are derived for shaping filter coefficients to increase the low-frequency attenuation slope of the magnitude of the higher-order sinusoidal input describing functions (HOSIDFs). By modifying the asymptotic attenuation behavior of these higher-order harmonics, the proposed design provides stronger harmonic suppression in frequency regions where reset action is undesired, while preserving the beneficial first-order harmonic phase advantage near the desired cross-over frequency. The reduction in nonlinear behavior is verified through HOSIDF analysis and a superposition-law test, demonstrating that higher-order shaping filters make the reset element behave more closely to a linear system at a certain range of frequencies. Experimental validation on an industrial motion stage demonstrates improved tracking performance, reduced higher-order harmonic content, and selective activation of the reset action in the intended frequency region.
Reliability Assessment and Performance Enhancement of Reset Control Systems
This paper develops a frequency-domain reliability assessment framework for reset control systems. The closed-loop higher-order sinusoidal-input describing function formulation is extended to explicitly include the reset-triggering signal generated through a shaping filter. Based on this signal, two metrics are introduced: \(σ_t\), which quantifies reset-time deviation, and \(σ_d\), which evaluates the tendency toward additional zero crossings. These metrics provide design-oriented indicators for identifying potentially unreliable reset behavior. To improve reset-triggering reliability, a first-order shaping filter is proposed for a generalized first-order reset element, increasing the low-frequency attenuation slope of the nonzero higher-order harmonics. The proposed analysis is evaluated on an industrial motion stage. The results show that the proposed metrics capture reliability issues that are not evident from the first-order closed-loop response alone and can therefore support the design of reset controllers with more reliable reset-triggering behavior.
Local Conformity-Based Evolutionary Game Modeling of UAV Swarm Under Byzantine Attack
Leveraging their flexible and efficient deployment capabilities, unmanned aerial vehicle (UAV) swarms have been widely applied in various mission scenarios. However, the open communication environment also exposes them to the threat of Byzantine attacks. Most existing studies assume independent decision-making by each UAV, neglecting that local conformity amplifies false information propagation. This paper constructs an evolutionary game model for UAV swarm under malicious attacks based on graph evolutionary game theory, revealing how local conformity rules govern the spread of deceptive strategies. Using death-birth updating rules, we derive the macroscopic dynamic equation for the fraction of deceptive strategies and the analytical solutions to its evolutionary stable states. Sim ulations reveal observation errors weaken malicious induction, while higher proportions of malicious nodes and greater attack intensity drastically amplify attack impacts. Moreover, the model exhibits strong topological robustness across regular, scale-free and random networks.
comment: 6 pages, 5 figures
Discrete Geometric Modeling and Extended State Estimation of Continuum Robots
In this paper, we present a fully discrete approach for the accurate and numerically efficient dynamical modeling and state estimation of continuum robots. The model is based on geometrically exact beams in a minimal, strain-based formulation and derived in the framework of Lie group variational integrators, allowing to preserve important geometric properties that we exploit to achieve high accuracy and numerical efficiency. We then propose a disturbance observer based on an extended Kalman filter formulation that reliably estimates system states as well as model uncertainties and external disturbances. Experiments on a real system validate the accuracy and efficiency of the proposed model and observer.
comment: ©2026 Maximilian Herrmann, Leander Pfeiffer, Paul Kotyczka. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND
Virginia Tech Transportation Safety Index (VTTSI)
The Virginia Tech Transportation Safety Index (VTTSI) is a real-time, cloud-native framework for quantifying intersection safety using multimodal connected-vehicle telemetry and multi-year VDOT crash history. Traditional crash-based methods rely on lagged, aggregated data and cannot reflect rapidly changing operational conditions. VTTSI addresses this gap through a hybrid modeling approach that fuses Empirical Bayes (EB) crash stabilization, uplift factors derived from speed and conflict behavior, and a CRITIC-weighted multi-criteria decision-making (MCDM) module combining SAW, EDAS, and CODAS. The system produces interpretable, exposure-adjusted safety scores on a 0--100 scale every 15 minutes. A cloud-deployed architecture built on FastAPI, PostgreSQL, PostGIS, and Streamlit supports interactive visualization of traffic volumes, VRU exposure, speed variance, and real-time incident activity. Validation across intersections demonstrates coherent diurnal patterns, consistency among MCDM methods, and sensitivity to observable operational turbulence. Sensitivity analysis further shows that the RT--SI is robust to parameter perturbations, with deviations typically remaining below one point on the 0--100 scale. By integrating long-term crash risk with short-term behavioral dynamics, VTTSI provides a transparent, adaptive, and proactive safety-monitoring framework suitable for transportation agencies, traffic management centers, fleet operators, and autonomous vehicle systems.%~\cite{persaud2007, montella2020systemic, schultz2025_surrogate, Amraji2025CombinedSafetyIndex}.
AI Data Centers and Power System Sustainability: Understanding the Sustainability Implications of AI-Driven Data Centers on Power Systems
The rapid expansion of artificial intelligence (AI) has driven unprecedented growth in data center electricity demand. The scale and pace of this load growth carry significant implications for the sustainability of electric power systems. On the one hand, rapid, spatially concentrated data center load growth is outpacing clean energy deployment in several major regions, raising emissions and challenging both grid flexibility and reliability. On the other hand, this fast-developing and capital-intensive sector offers abundant opportunities to advance sustainability through clean energy integration and operational innovations. This article provides an overview of the mechanisms through which data center affect power system sustainability, underscoring both risks and the potential. Specifically, this article (i) characterizes AI data center load behavior and categorizes electricity supply configurations by function and sustainability profile, as well as situates these loads within global and regional electricity demand trends; (ii) analyzes sustainability impacts across short-run operational and long-run planning mechanisms, evaluates effects on grid carbon emissions and renewable energy utilization, and feasibility of offering system flexibility and participating in ancillary service; and (iii) evaluates real-world corporate sustainability pathways and highlighting both the system benefits and feasibility limits of current carbon accounting practices. The goal of this work is to synthesize existing knowledge and technological developments and to guide research and development toward a more sustainable integration of AI data centers and electric power systems.
comment: 13 pages, 3 figures, to be published in IEEE Energy Sustainability Magazine
Towards Fewer Control Laws via Continuous-Time Multiparametric Programming
Multiparametric programming offers a powerful solution to the computational burden of solving optimal control problems repeatedly online. By solving the problem once offline, it yields the optimal control laws as explicit closed-form functions of the initial system state, reducing online execution to a direct evaluation with no iterations required. Most existing works build this idea on a discrete-time foundation, slicing the time horizon into intervals and applying KKT conditions to the resulting algebraic system. This discretization forces a tradeoff: too few intervals and the model fails to capture the true system dynamics, while too many cause the problem size, the number of decision variables, and the number of critical regions to grow rapidly, making both offline preparation and online lookup increasingly expensive. This work develops a multiparametric framework that works directly with the continuous-time problem. Pontryagin's Maximum Principle (PMP) is applied without any model discretization, and the optimal control is recovered as an explicit function of the initial state. Compared to the discrete-time formulation, the continuous-time approach produces substantially fewer critical regions, and this number remains fixed regardless of accuracy requirements, since it reflects the structure of the problem itself rather than a discretization grid. The framework also yields the switching times as explicit functions of the initial state, directly exposing when and how the optimal control structure changes over the horizon. Knowing these switching times in advance allows the real-time controller to skip unnecessary computations between them, further reducing the online execution cost. Results from a PAROC framework case study demonstrate that the continuous-time multiparametric approach is a rigorous alternative to the conventional discrete-time formulations.
Optimal Control of an SIR Model with Noncompliance as a Social Contagion
We propose and study a compartmental model for epidemiology with human behavioral effects. Specifically, our model incorporates governmental prevention measures aimed at lowering the disease infection rate, but we split the population into those who comply with the measures and those who do not comply and therefore do not receive the reduction in infectivity. We then allow the attitude of noncompliance to spread as a social contagion parallel to the disease. We derive the reproductive ratio for our model and provide stability analysis for the disease-free equilibria. We then propose an optimal control scenario wherein a policy-maker with access to control variables representing disease prevention mandates, treatment efforts, and educational campaigns aimed at encouraging compliance minimizes a cost functional incorporating several cost concerns. Via careful analysis of the control-to-state map, we are able to prove existence of optimal controls. Our proof applies to dynamics which can be nonlinear in the control variables and general cost functionals including the case of $L^1$ control costs. We numerically resolve optimal strategies using the sequential quadratic Hamiltonian method, a relatively new numerical method for optimal control which is easy to implement and has good convergence theory, as we demonstrate. We test our model in several parameter regimes with specific interest in observing how the policy-maker's optimal strategies depend on their particular preferences which are expressed via design of different cost functionals.
comment: 27 pages, 8 figures
Adaptive Covariance Kalman Filtering and Nonlinear Decoupling Control via Feedback Linearization for a Three-Tank Process
Hydraulic three-tank systems are widely used in water treatment and liquid storage applications, where accurate level regulation is essential for safe and efficient operation. This paper investigates linear and nonlinear control strategies for reference tracking in a three-tank process. A linear state-feedback controller with integral action is first designed based on a linearized model, followed by a nonlinear decoupling controller using feedback linearization. In addition, an adaptive covariance Kalman filter (AKF) is employed for state estimation by dynamically updating the process-noise covariance matrix. Numerical simulations demonstrate that both control approaches achieve satisfactory reference tracking, while the proposed AKF provides accurate state estimation and effectively captures the nonlinear system behavior. The results highlight the effectiveness of combining nonlinear control and adaptive state estimation for hydraulic process systems.
comment: This paper was published in International Journal of Mechanical & Mechatronics Engineering, vol. 21, no. 03, pp. 41-48, June 2021
Proximal Gradient Dynamics and Feedback Control for Equality-Constrained Composite Optimization
This paper studies equality-constrained composite minimization problems. This class of problems, capturing regularization terms and inequality constraints, naturally arises in a wide range of engineering and machine learning applications. To tackle these optimization problems, inspired by recent results, we introduce the \emph{proportional--integral proximal gradient dynamics} (PI--PGD): a closed-loop system where the Lagrange multipliers are control inputs and states are the problem decision variables. First, we establish the equivalence between the stationary points of the minimization problem and the equilibria of the PI--PGD. Then for the case of affine constraints, by leveraging tools from contraction theory we give a comprehensive convergence analysis for the dynamics, showing convergence to a stationary point. Moreover, under suitable assumptions, we show linear--exponential convergence towards the equilibrium. That is, the distance between each solution and the equilibrium is upper bounded by a function that first decreases linearly and then exponentially. Our findings are illustrated numerically on a set of representative examples, which include an exploratory application to nonlinear equality constraints.
comment: 19 pages, 13 figures
Frequency Response Identification of Low-Order Systems: Finite-Sample Analysis
This paper proposes a frequency-domain estimator for low-order systems from repeated noisy measurements. The estimator minimizes a quadratic data-fitting term regularized by the nuclear norm of a Loewner matrix, subject to a convex stability constraint enforced via a semidefinite program. We prove a finite-sample error bound at the sampled frequencies and extend it to all frequencies through rational interpolation. The bound characterizes the dependence on the number of repeated experiments, number of frequency points, system order, and noise level. Numerical experiments on SISO and MIMO systems demonstrate the low-order-promoting effect of the method and validate the predicted scaling laws.
comment: 16 pages, Submitted to IEEE Transactions on Automatic Control
Passivity-exploiting stabilization of semilinear single-track vehicle models with distributed tire friction dynamics
This paper addresses the local stabilization problem for semilinear single-track vehicle models with distributed tire friction dynamics, represented as interconnections of ordinary differential equations (ODEs) and hyperbolic partial differential equations (PDEs). A passivity-exploiting backstepping design is presented, which leverages the strict dissipativity properties of the PDE subsystem to achieve exponential stabilization of the considered ODE-PDE interconnection around a prescribed equilibrium. Sufficient conditions for local well-posedness and exponential convergence are derived by constructing a Lyapunov functional combining the lumped and distributed states. Both state-feedback and output-feedback controllers are synthesized, the latter relying on a cascaded observer. The theoretical results are corroborated with numerical simulations, considering non-ideal scenarios and accounting for external disturbances and uncertainties. Simulation results confirm that the proposed control strategy can effectively and robustly stabilize oversteer vehicles at high speeds, demonstrating the relevance of the approach for improving the safety and performance in automotive applications.
comment: 16 pages, 11 figures. Accepted at Automatica
An Information-Based Micro-Kalman Filter for Satellite Tracking: A Comparative Study
Satellite dynamics and tracking remain important challenges in the context of space exploration and communication systems. Accurate state estimation is essential to maintain reliable orbital motion and system performance. This paper presents a mathematical framework for satellite state estimation based on a linearized model described by radial and angular states. The model incorporates two types of measurement noise corresponding to range and scaled angular deviations, which are assumed to be mutually independent with known covariance structures. The estimation problem is formulated using the Kalman filter, together with the associated Algebraic Riccati Equation (ARE), leading to both time-varying and steady-state solutions. In addition, a micro-Kalman filter ($μ$KF) formulation is considered and compared with the classical Kalman filter, as well as with the extended Kalman filter (EKF), unscented Kalman filter (UKF), and an adaptive Kalman filter under a unified simulation setup. The results demonstrate that the proposed $μ$KF achieves estimation performance nearly identical to that of the classical Kalman filter and its variants, with small and bounded estimation errors. The mean square estimation error (MSEE) remains low for all state variables under both noise configurations, confirming the effectiveness of the proposed approach for linear Gaussian systems.
comment: This version extends the previous version by including additional simulations, a comparative study with EKF, UKF, and adaptive Kalman filters, and enhanced trajectory visualization
Ultrafast On-Chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks ICML'26
Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.
comment: Forty-Third International Conference on Machine Learning (ICML'26)
Decision-Focused Continual Learning for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks
Power-logistics scheduling in modern seaports typically follows a predict-then-optimize pipeline. To enhance the decision quality of predictions, decision-focused learning has been proposed, which aligns the training of forecasting models with downstream decision outcomes. However, this end-to-end design inherently restricts the value of forecasting models to a specific task structure and therefore generalizes poorly to evolving tasks induced by varying vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher-information-based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model across a varying task stream with sustainable long-term computational and memory requirements. Experiments calibrated to Jurong Port show improved decision performance and cross-task generalization over existing methods, together with reduced computational cost and a bounded memory footprint.
comment: Preprint to IEEE Transactions on Smart Grid
Leaderless Collective Motion in Affine Formation Control over the Complex Plane
We propose a method for the collective maneuvering of affine formations in the plane by modifying the original weights of the Laplacian matrix used to achieve static formations of robot swarms. Specifically, the resulting collective motion is characterized as a time-varying affine transformation of a reference configuration, or shape. Unlike the traditional leader-follower strategy, our leaderless scheme allows agents to maintain distinct and possibly time-varying velocities, enabling a broader range of collective motions, including all the linear combinations of translations, rotations, scaling and shearing of a reference shape. Our analysis provides the analytic solution governing the resulting collective motion, explicitly designing the eigenvectors and eigenvalues that define this motion as a function of the modified weights in the new Laplacian matrix. To facilitate a more tractable analysis and design of affine formations in 2D, we propose the use of complex numbers to represent all relevant information. Simulations with up to 20 agents validate the theoretical results.
comment: Accepted for publication in IEEE Transactions on Control of Network Systems (TCNS), 12 pages
A Geometric Task-Space Port-Hamiltonian Formulation for Redundant Manipulators
We present a novel geometric port-Hamiltonian formulation of redundant manipulators performing a differential kinematic task $η=J(q)\dot{q}$, where $q$ is a point on the configuration manifold, $η$ is a velocity-like task space variable, and $J(q)$ is a linear map representing the task, for example the classical analytic or geometric manipulator Jacobian matrix. The proposed model emerges from a change of coordinates from canonical Hamiltonian dynamics, and splits the standard Hamiltonian momentum variable into a task-space momentum variable and a null-space momentum variable. Properties of this model and relation to Lagrangian formulations present in the literature are highlighted. Finally, we apply the proposed model in an \textit{Interconnection and Damping Assignment Passivity-Based Control} (IDA-PBC) design to stabilize and shape the impedance of a 7-DOF Emika Panda robot in simulation.
An Adaptive Online Smoother with Closed-Form Solutions and Information-Theoretic Lag Selection for Conditional Gaussian Nonlinear Systems
Data assimilation (DA) combines partial observations with dynamical models to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the utilization of observations only within a nearby window, thus reducing computational complexity and storage needs. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, the advantage of reduced computational storage expenditure is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence.
comment: Final revision. 46 pages (Main Text pp. 1--28; Appendix pp. 29--40), 9 figures (7 in Main Text, 2 in Appendix). Published in Journal of Nonlinear Science (Springer Nature). Code available upon request. For further details visit https://mariosandreou.short.gy/OnlineSmootherCGNS
Minimal Set of Questions for Theories of Consciousness: Toward a Unified Explanatory Framework
A central challenge in consciousness research is the lack of agreement on what a theory of consciousness should explain, which makes it difficult to compare existing theories. We propose a framework for organizing explanatory targets of theories based on a minimal set of seven questions designed to be theoretically neutral, causally and functionally relevant, and applicable across different systems. We focus particularly on the role of causation based on the argument that causal relations cannot be fully specified within standard physical descriptions alone. Introducing an asymmetric causal structure allows internal mechanisms to be represented explicitly and helps distinguish between variable- and structure-level causation. As an example, we apply the proposed framework to analyzing the Dual-Laws Model. The aim of the framework is not to propose a definitive theory but to provide a common basis for analyzing and developing theories of consciousness.
Harvest and Jam: Optimal Self-Sustainable Jamming Attacks against Remote State Estimation
This paper considers the optimal power allocation of a jamming attacker against remote state estimation. The attacker is self-sustainable and can harvest energy from the environment to launch attacks. The objective is to carefully allocate its attack power to maximize the estimation error at the fusion center. Regarding the attacker's knowledge of the system, two cases are discussed: (i) perfect channel knowledge and (ii) unknown channel model. For both cases, we formulate the problem as a Markov decision process (MDP) and prove the existence of an optimal deterministic and stationary policy. Moreover, for both cases, we develop algorithms to compute the allocation policy and demonstrate that the proposed algorithms for both cases converge to the optimal policy as time goes to infinity. Additionally, the optimal policy exhibits certain structural properties that can be leveraged to accelerate both algorithms. Numerical examples are given to illustrate the main results.
Sandbox-Enabled Digital Twin for Cyber-Physical Systems
Firmware/software in cyber-physical system (CPS) embedded devices/controllers can have vulnerabilities stemming from multiple sources such as weak security practices, outdated libraries, or supply chain attacks that induce adversarial effects under plant state-based triggers. However, pre-deployment validation of CPS controllers typically relies on digital twins that model controller logic as a black box. On the other hand, side channel monitoring and anomaly detection of CPS controller firmware/software is complementary, but is typically exercised with synthetic inputs or under specific CPS operational profiles and does not simultaneously track software execution and CPS plant evolution. To bridge this gap, we present a closed-loop digital twin framework that hosts unmodified controller binaries in a Linux sandbox (SaMOSA) with its I/O rerouted to an external plant simulator. The framework captures four time-synchronized side channels (hardware performance counters, system calls, disk activity, network activity) alongside plant state and provides orchestration hooks for automated, repeatable, parameterized runs. We demonstrate the framework on an OpenPLC runtime controlling a Modbus-connected IEEE 14-bus power system, and also briefly discuss application to robotics systems. The synchronized traces correlate internal controller execution with plant events, providing an observability foundation for online testing, coverage analysis, and vulnerability detection.
comment: 5 Pages, 4 Figures
Robotics
MemoryWAM: Efficient World Action Modeling with Persistent Memory
Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.
Generating Robot Hands from Human Demonstrations
Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip motion from everyday manipulation, our algorithm optimizes tree-structured robot hands to reproduce desired target motions. The framework produced both a 6-degree-of-freedom (DoF) general-purpose hand and lower-DoF task-specific hands with spatial four-bar mimic joints. To accelerate the search over designs, we trained a reinforcement-learning (RL) actor to propose good hand designs and joint angles, reducing search time from hours to minutes. We fabricated the mechanisms directly as one-piece articulated structures with print-in-place joints. In real-world experiments, the 6-DoF hand achieved highly accurate teleoperated fingertip tracking better than available commercial robot hands, whereas the specialized 3-DoF hands reproduced structured human and synthetic trajectories with reduced mechanical complexity. These results showed that large-scale human motion data can be used not only to train robot controllers but also as a reference for optimizing and generating the physical embodiment of robots.
The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.
comment: preprint, 19 pages, 3 figures
Increasing Resilience of Continuum Robots via Motion Planning Algorithms
This paper presents an experimental study of motion planning for resilient continuum robots. In this study we mainly focused on multi-criteria decision-making, its application for path-planning algorithms, impact on the generated path and execution time. To do this, we used two well-known algorithms for path planning, namely Genetic algorithm and A star algorithm, and modified them by adding the Analytical Hierarchy Process algorithm to evaluate the quality of the paths generated. In our experiment the Analytical Hierarchy Process considers four different criteria, i.e. distance, motors damage, mechanical damage of the robot's arm and accuracy, each considered to contribute to the resilience of a continuum robot. The use of different criteria is necessary to increase the time to maintenance operations of the continuum robot. We conducted the experiments using two different simulated environments of the robot. Although we significantly simplified the robot's model and its environment, we still implemented some of the features of the environment based on the real robot prototype. In particular, one of the environments has single- as well as multi-path points, and other consists of the multi-path points only. The results show that, in contrast to A star, the performance time of Genetic algorithm does not depend on the environment's cardinality. It generates more diverse paths, which increases the robot's resilience.
Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation IROS 2026
Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.
comment: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates
Vision-language navigation agents achieve competitive average success on benchmark tasks, yet failures often arise through predictable trajectory-level breakdowns such as oscillation, stagnation, or inefficient detours. Reliable deployment, therefore, requires uncertainty signals that anticipate emerging failure dynamics during execution rather than reflect only instantaneous action entropy. We introduce \emph{GroundControl}, a trajectory-consistent uncertainty estimator defined as statistical deviation from nominal goal-directed distance-to-goal dynamics aggregated over an episode. GroundControl models distance evolution using a constant-velocity Kalman filter and combines normalized innovation statistics with complementary trajectory features capturing progress, monotonicity, path efficiency, and oscillatory behavior. The resulting uncertainty score reflects geometric and temporal inconsistency in navigation behavior rather than local prediction dispersion. To evaluate uncertainty quality independently of task success, we formalize \emph{Selective Risk--Coverage Navigation (SRCN)}, a protocol that measures how effectively an uncertainty score ranks episodes by failure or inefficiency using risk--coverage curves and AURC / E-AURC summaries. Across five EB-Navigation splits ($N=300$ episodes), trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk, with weighted-average $\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$ for the GPT-4o model, substantially outperforming entropy-, conformal-, and heuristic baselines. Under SPL-based selective evaluation, GroundControl consistently achieves the lowest AURC and E-AURC across models and navigation splits. These results show that modeling deviation from goal-directed dynamics provides an interpretable and robust signal for anticipating navigation failures in vision-language agents.
Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation
Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.
ARC: Adaptive Robust Joint State and Covariance Estimation
Sensor measurements are frequently corrupted by outliers and non-Gaussian noise. These imperfections in the sensor data can cause classical state estimators to generate biased and unreliable state and uncertainty estimates. Robust estimators reject or downweight outliers but do not perform measurement covariance estimation, whereas joint state and covariance estimators assume Gaussian residuals and fixed loss shape parameters. Integrating these two capabilities into a single framework is an opportunity to simultaneously estimate both state and covariance in the presence of outliers. This paper proposes a unified Block-Coordinate Descent framework that combines a norm-aware adaptive robust loss, an Iteratively Reweighted Least-Squares state update, and a Minimum Weighted Covariance Determinant covariance estimator, yielding a self-tuning joint state and covariance estimator. The framework is evaluated in a Monte-Carlo simulation and on real-world ultra-wideband localization experiments in cluttered non-line-of-sight environments. Results show that the proposed estimator consistently recovers the true inlier measurement covariance and matches or exceeds the state estimation accuracy of all baselines, without requiring any manual parameter tuning.
comment: Submitted to information IEEE Robotics and Automation Letters (RA-L), June 2026. 8 pages, 7 figures, 1 table
TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation IROS
Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026
LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Illumination-Robust Mapping IROS 2026
Gaussian Splatting has enabled real-time neural rendering, yet existing LiDAR-inertial-visual (LIV) Gaussian mapping pipelines remain fragile under illumination changes and texture-deficient scenes due to their reliance on RGB photometric cues. We present LIT-GS, a LiDAR-inertial-thermal Gaussian Splatting framework that injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization. Specifically, we exploit LIV visual map points as confidence-aware cross-modal anchors to establish reliable thermal-LiDAR associations, and incorporate weighted LiDAR point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision. Building on the refined structure, we further introduce a LiDAR-plane-regularized differentiable splatting objective that constrains rendered 3D points to align with locally observed planes, mitigating surface thickening and structural drift in low-contrast thermal imagery. Experiments on proprietary sequences and public datasets demonstrate that LIT-GS consistently improves geometric accuracy and rendering quality over state-of-the-art LIV-based Gaussian Splatting baselines, particularly in challenging lighting conditions.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems
Spacecraft guidance, navigation, and control functions are increasingly realized as learned policies distilled from expert solvers. Developing such a policy is itself a research process: an investigator selects an architecture and hyperparameters, runs experiments, and must determine whether an apparent improvement is genuine or merely seed noise. This paper presents AutoResearch, a framework in which a large language model autonomously drives that loop for aerospace control problems, coupled with a credibility layer, built into the loop, that certifies each reported result against the problem's own measured seed noise. The language model serves only as the offline research agent that develops the control policy; the trained policy it produces is then deployed onboard the spacecraft, while the model itself never operates the vehicle. At each iteration the agent reads a plain-language problem description and the run history, proposes a single edit to the training script, executes it, and logs the outcome. No reported result is credited until it passes the same three checks: measured per-problem seed noise, reseeded verification of the best configuration, and leave-one-out pruning of the agent's edits. The same loop is applied, unchanged, to two aerospace control problems: a Clohessy-Wiltshire relative rendezvous and a safety-constrained collision-avoidance docking past a keep-out zone, each calibrated against a known optimal control benchmark. In both, the audited policy clears the measured seed noise by many standard deviations; an undirected search over the same parameters does not. On the docking problem the gap becomes categorical: undirected search yields no feasible policy, while the learned policy stays outside the keep-out zone on every seed.
CoLI: A Reproducible Platform for Continuum Robot Learning via Monolithic 3D Printing and Isomorphic Teleoperation IROS2026
Continuum robots offer strong potential for manipulation tasks due to their high degrees of freedom, compliant structures, and operational safety. However, their adoption in both research and practical applications has been hindered by reproducibility issues arising from complex fabrication and assembly processes, challenging kinematic modeling, and a lack of intuitive control interfaces. To address these challenges, we present a novel open-source continuum robot design. The platform features a simplified fabrication pipeline enabled by multi-material 3D printing, allowing the arm to be fabricated as a monolithic compliant structure with minimal assembly. Control is achieved through an isomorphic teleoperation interface that establishes a direct actuator-level mapping, eliminating the need for explicit kinematic modeling and providing a singularity-free mapping. Building on this hardware design, the platform further supports imitation-learning-based autonomous control. The proposed system is evaluated through hardware characterization and a set of manipulation tasks. Experimental results demonstrate that the platform provides a reproducible, learning-ready continuum robot system, accelerating algorithmic development and systematic benchmarking for the continuum robotics community.
comment: 8 pages, 7 figures, 1 table, accepted by IROS2026
An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements
The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.
Autonomous Driving with Priority-Ordered STL Specifications Under Multimodal Uncertainty
Autonomous vehicles must plan trajectories that satisfy a multitude of requirements on safety, passenger comfort, and compliance with traffic rules. However, in safety-critical scenarios, it is not always possible to satisfy all requirements simultaneously, necessitating their prioritization based on importance. At the same time, in these safety-critical scenarios, the uncertainty in trajectory predictions of the surrounding traffic, such as other vehicles and pedestrians, should be explicitly accounted for. In this work, we propose an uncertainty-aware trajectory planning framework that incorporates a predefined lexicographic ordering over Signal Temporal Logic (STL) specifications that stays valid under uncertainty. We implement this formulation with Model Predictive Path Integral (MPPI) control and we demonstrate the effectiveness of our method on simulation scenarios, showing that our framework efficiently handles conflicting objectives under realistic multi-modal uncertainty.
Towards 3D karst underwater scene reconstruction from rotating sonar data
Karst aquifers provide critical freshwater resources but pose significant hazards due to their complex and poorly understood subsurface geometry. Mapping these environments is challenging because sonar data from underwater exploration is sparse and noisy, while navigation estimates suffer from drift limiting standard 3D reconstruction methods. We present a pipeline for reconstructing underwater karst conduits from a sonar profiler. We combine a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep learning method for surface reconstruction, producing an immersive and navigable 3D mesh for hydrogeological analysis.
comment: 1st Workshop on Long-term Deployments in the Wild (LoWi)
Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems
Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.
Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications
AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.
comment: Accepted and best paper award at MHI-Kolloquium 2024
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.
Mobile Target Search with Imperfect Perception: A Partially Observable Stochastic Game Theoretical Approach
This paper investigates mobile target search under imperfect perceptions caused by sensor limitations, malicious jamming, or communication noise. Searchers and targets operate in a grid-shaped area with bounded mobility, leading to a dynamic interplay between search and evasion. To capture this adversarial interaction under imperfect perceptions, we adopt the partially observable stochastic game (POSG) approach, which generalizes partially observable Markov decision processes (POMDPs) by incorporating target intelligence. To handle false alarms and missed detections caused by perceptual uncertainties, we propose a novel detectability concept to determine whether a search strategy guarantees eventual detection, and provide sufficient detectability criteria based on stochastic recurrence analysis. We further develop a server-assisted distributed algorithm that utilizes the aggregative potential game structure for searchers and a KL-divergence-based reduction for target prediction. Numerical simulations validate the effectiveness of the proposed algorithm and support the detectability analysis.
FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching
Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at https://fra-tsuna.github.io/flowmaps/.
Stable Transformer-Actor-Critic Model Predictive Control: A Contraction Analysis Approach
Actor-Critic Model Predictive Control (MPC) effectively addresses complex, non-convex control problems, but guaranteeing the closed-loop stability of sequence-based learning models within these pipelines remains challenging. This paper introduces a novel Transformer-Actor-Critic MPC architecture with formal robustness guarantees. First, we prove that Transformer networks can satisfy global incremental Input-to-State Stability ($δ$ISS). We then leverage Riemannian contraction theory to analyze the interconnected dynamics between the physical plant and the predictive neural network. Finally, we integrate these theoretical bounds as a training regularizer to yield a certifiably robust policy. The framework is validated on a nonlinear 3D drone model executing target-reaching and obstacle-avoidance maneuvers.
Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation
Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin ECCV 2026
Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.
comment: Accepted to ECCV 2026. Maciej and Jesper contributed equally
Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration
Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.
comment: Preprint accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 8 pages, 9 figures, 3 tables
Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation
Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.
Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform
Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.
Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation
Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.
VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC IROS 2026
Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.
comment: 8 pages, 17 figures. Accepted at IROS 2026
MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs
Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.
comment: Published in CoRL 2025
A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems
Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.
Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory
Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.
Evaluation of Augmented Reality-based Intuitive Interface for Robot-Assisted Transesophageal Echocardiography: A User Study
TransEsophageal Echocardiography (TEE) is essential for diagnosing and guiding Structural Heart Disease (SHD) interventions. However, manual TEE manipulation demands significant operator expertise, is physically demanding, and exposes clinicians to radiation when performed alongside fluoroscopy. Robotic-assisted TEE systems have been introduced to improve probe handling and reduce operator fatigue, yet the design of intuitive and effective user interfaces remains an open challenge. This study presents and evaluates a model-enhanced, Augmented Reality (AR)-based intuitive interface for robot-assisted TEE, designed to improve spatial awareness and control intuitiveness. A robotic TEE platform integrated with electromagnetic tracking and a virtual simulator was used to compare three user interfaces differing in visualization and interaction modalities: 2D jointlevel (2D-JI), 3D joint-level (3D-JI), and 3D tip-level (3D-TI). Thirty six participants performed standardized navigation tasks to reproduce target echocardiographic views, with performance assessed via position and orientation errors, completion time, and NASA-TLX workload scores. Results show that 3D visualization significantly improved spatial accuracy, reducing median position error from 13 mm to 3 mm and halving the orientation error compared with the 2D interface. Tip-level interaction yielded a further 50% reduction in orientation error and reduced interuser variability relative to joint-level control. Overall, the 3D-TI configuration, combining immersive visualization with direct tip-level control, proved the most effective and ergonomic interface, supporting the integration of AR-based visualization and intuitive control paradigms into next-generation robotic TEE systems to enhance operator performance and procedural safety.
Motor Angular Speed Preintegration for Multirotor UAV State Estimation
A precise state estimate is crucial for a tight feedback control that enables agile and near-obstacle flights of UAVs. The state-of-the-art methods fuse slow pose measurements with high-frequency inertial measurements to obtain a precise state estimate. However, the inertial measurements from the IMU onboard the UAV are degraded by vibrations from spinning propellers and the precision of the estimated state suffers. We propose a novel approach based on the preintegration of accelerations obtained from motor speeds. We show that the accelerations obtained in this manner can be used for state propagation on their own to achieve better precision without including the IMU. Further, we propose a factor composed of the preintegrated motor speeds that can be directly employed in factor graph optimization frameworks. We combine our factor with LiDAR measurements into the proposed Motor Angular Speed LiDAR Odometry (MAS-LO) algorithm for precise state estimation, which we open-source. Lastly, we evaluate the estimation precision against a state-of-the-art inertial algorithm LIO-SAM to show 28% improvement in position and 65% in velocity estimation accuracy, 14% lower measurement lag, and high robustness to wrong parameter values.
SWAP: Symmetric Equivariant World-Model for Agile Robot Parkour
While latent world models enable the proactive predictions required for extreme parkour, their purely data-driven nature forces them to redundantly encode left-right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end-to-end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor-critic networks. In real-world tests, the robot leaps across a 2.13 m gap and climbs a 1.63 m platform, breaking records for quadruped parkour. Furthermore, the framework exhibits robust geometric generalization to unseen mirrored terrains and exceptional zero-shot transferability across diverse outdoor environments. These results demonstrate that symmetry equivariance is an effective structural prior for pushing the physical boundaries of learned legged locomotion.
Deep-Unfolded Coordination
Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.
comment: The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables
Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.
One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms
Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.
comment: 6 pages, 5 figures, 3 tables
MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM ICRA 2026
3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.
comment: ICRA 2026
World Engine: Towards the Era of Post-Training for Autonomous Driving
Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.
comment: Technical Report. Project Page: https://opendrivelab.com/WorldEngine/
TIDY: Thermal Infrared Image Denoising via Wavelet Domain Entropy and Directional Stripe Index
Thermal infrared (TIR) imaging has been a popular choice for field robotics due to its robust perception capability under low light visual degradation, but it suffers from severe stochastic and fixed-pattern noise that breaks downstream estimation. This noise is intensified indoors due to low thermal contrast and uniform temperature distributions, contributing to the relative lack of indoor TIR deployments. Existing TIR denoising methods exhibit a poor accuracy-efficiency tradeoff, either too slow for online deployment required in robotics or insufficiently robust to severe degradation, while typically being trained on synthetic noise. Addressing these problems, we propose TIDY, a lightweight wavelet-domain denoiser trained on real clean-noisy TIR data. By reformulating TIR denoising in the wavelet domain, TIDY explicitly disentangles noise from structural content, enabling targeted suppression with reduced spatial complexity, significantly improving inference speed over prior methods (~34Hz). TIDY introduces two new metrics, Wavelet Entropy and Wavelet Directional Stripe Index, as complementary loss terms to explicitly suppress stochastic noise and stripe artifacts. Across severe indoor corruption and zero-shot settings, TIDY improves robustness and yields consistent gains in downstream robotics tasks including thermal inertial odometry and monocular depth estimation. Code and dataset is available at: https://github.com/williamrheeth/TIDY
EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present \textsc{EquiVLA}, the first general framework for end-to-end $\mathrm{SO}(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. \textsc{EquiVLA} introduces \textsc{EquiPerceptor}, which produces approximately $\mathrm{SO}(2)$-equivariant visual representations from frozen ViT features; and \textsc{EquiActor}, an exactly $\mathrm{SO}(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $\mathrm{SO}(2)$ equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD$\to$D, and five real-robot tasks on Mobile ALOHA, \textsc{EquiVLA} achieves $92.6\%$ average success on LIBERO (vs. $78.1\%$ baseline), an average sequence length of $4.03$ on CALVIN (vs. $3.45$), and improves real-robot success from $54\%$ to $72\%$.
comment: Comment: First version 22 pages, project site: https://equivla.github.io/
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection
Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise before generation begins, allowing the unmodified flow ODE to produce a coherent next chunk. This reframes asynchronous inference as a noise selection problem rather than a trajectory steering problem. We introduce \textbf{PAINT}, a training-free method that finds this noise via backward Euler inversion and constructs the final chunk through a repainting rule. In summary, \texttt{PAINT} requires no gradients, retraining, or policy modification; yet it improves execution consistency and task performance across \textit{12 simulated benchmarks} and \textit{6 real-world manipulation tasks} spanning single-arm, bimanual, and humanoid embodiments. Website: ~\href{https://paint-action-chunking.github.io}{\texttt{https://paint-action-chunking.github.io}}.
comment: First version 19 pages, project site: https://paint-action-chunking.github.io
Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI
The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.
Temporal Self-Imitation Learning
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents
Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.
comment: Submitted to the 2026 International Symposium of Robotics Research (ISRR)
Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning
Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.
comment: 16 pages, 14 figures
A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data
Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.
Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes
Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.
comment: 6 pages, 7 figures
Route-Constrained Robust Fusion Estimation for MEMS/GNSS Integrated Navigation of Unmanned Ground Vehicles in GNSS Degraded Environments ICRA 2026
To address cumulative localization drift of unmanned ground vehicles in structured road environments under severe Global Navigation Satellite System signal occlusion, this paper proposes a robust route-constrained state estimation method. During periods without satellite signals, the proposed method establishes the correspondence between the historical dead reckoning trajectory and local segments of the mission route extracted from a high-definition map, and estimates a route-referenced position via a two-dimensional rigid transformation. The estimated position is then formulated as a pseudo-position observation and incorporated into an Extended Kalman Filter update. In this way, route constraints at the road level can be continuously injected into a unified state estimation framework, thereby suppressing position deviation relative to the mission route while indirectly improving azimuth estimation. To enhance practical applicability, engineering strategies, such as trigger control, matching quality validation, route offset compensation, and single update correction limiting, are further introduced. Experiments in three representative scenarios, including a long tunnel, a multi-segment tunnel, and a curved tunnel, show that the proposed method effectively suppresses error accumulation during satellite outages, reduces the risk of large maximum deviation, and improves localization continuity and road-level usability.
comment: Accepted workshop paper, 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, IEEE ICRA 2026
ForEnt: A Multi-Modal Dataset for Characterizing Quadruped Robot Entrapments in Forest Environments
Legged robots are increasingly deployed in forests for ecological surveying and monitoring, yet their autonomy is often interrupted consequent to the challenges posed in traversing forest environments. Forest entrapments, for example, when a robot's legs are ensnared in vines or other vegetation, result in loss of stability and toppling. Such events not only disrupt the mission and require manual intervention, but also risk damage to the robot hardware. To address the absence of a dedicated dataset to investigate these failure modes in forest environments, we present ForEnt, a multi-modal dataset collected with the low-cost Unitree Go2 quadruped across eight forest sites in the Southampton Common Woodlands, UK. For our dataset, over approximately 1.7 km of traversals in 11 sequences were conducted, yielding 69 recorded entrapment events. ForEnt includes time-synchronized RGB-D images, LiDAR scans, proprioceptive data, and third-person video, enabling analysis of terrain factors contributing to entrapment and providing labeled sensor streams for reproducible benchmarking. By supporting the evaluation of entrapment detection strategies, ForEnt lowers the barrier to developing robust quadruped robot deployments in challenging forest environments.
comment: 8 pages, 7 figures
Safe Local Navigation for Ackermann-Steered Robots in Unmapped Environments
A control framework is proposed for safe local navigation of mobile robots equipped with Ackermann steering in unmapped environments where a global goal is absent. Based on local obstacle detections, the safest heading angle is determined along the direction of the largest open space ahead of the vehicle. Guided by this direction, bounding lines are constructed on the left and right sides of the vehicle to achieve obstacle separation. These bounding lines are obtained by solving a convex quadratic optimization that maximizes vehicle-to-obstacle clearance. Optionally, conditions are imposed on the bounding lines to preserve parallelism and smooth abrupt changes from prior control steps. A feedback-linearizing controller is then used to regulate the vehicle's distance from one or both bounding lines, effectively enabling tracking of a local reference path that preserves safety through obstacle clearance maximization. Open-source code is included for the application of this control scheme. Experimental results demonstrate that the proposed method produces safer navigation paths with significantly shorter computation times, compared to some existing exploration-based planners.
comment: Presented at the 23rd Conference on Robots and Vision (CRV 2026)
Duet: Dual-Robot Understanding via Efficient Teaching
Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across environments and executing coordinated handovers. Data acquisition is the primary bottleneck for training these systems. To this end, we introduce DUET, a dual-robot learning framework for mobile manipulation. For efficient data collection, we create a unified dual-embodiment synchronized VR-based teleoperation system for in-domain heterogeneous robot data collection. We further develop a complementary tracking pipeline that records human-human coordination and collaborative mobile manipulation priors. To allow efficient learning, we introduce an Action Chunking Transformer based architecture that first pretrains collaborative policies on efficient human-human demonstrations, before finetuning them on a minimal set of real-robot teleoperation trajectories. We develop a benchmark of four collaborative tasks to evaluate our framework using a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator. The results demonstrate that harnessing human priors not only yields superior task performance compared to baselines trained only on robot data, but also reduces the total human effort required for data collection. Our human data collection pipeline achieves 5.4x acceleration on average from teleoperation, but we perform equally or better than robot-only data trained policies across all tasks. Our project page is available at https://zhaoy37.github.io/Duet/.
Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City
As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2
comment: 11 pages main body. 42 pages total. Data publicly available online
Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking
Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L), 2026
Integrated Exploration-Aware UAV Route Optimization and Path Planning
Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.
A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation CEC
Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.
comment: This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore
PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation
Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.
Immersive and Wearable Thermal Rendering for Augmented Reality
We present a proof-of-concept palm-mounted thermal feedback prototype addressing thermal rendering challenges specific to augmented reality (AR), where users must interact with both real and virtual objects in their physical workspace. In contrast to thermal feedback systems developed for virtual reality, AR thermal feedback must preserve manual dexterity, maintain access to real-world thermal cues, and provide coherent virtual temperature sensations without obstructing natural object interaction. We propose three AR-specific design considerations, which our prototype implements: indirect feedback to preserve fingertip dexterity, active thermal passthrough to sense and render the temperature of contacted physical surfaces, and spatially and temporally varying thermal rendering across the palm. Human-subject experiments evaluated perceptual sensitivity, indirect feedback, active thermal passthrough, spatial pattern recognition, and moving thermal rendering during AR interaction. Results showed that although indirect feedback reduced perceived realism during visual contact at the fingertips, it did not reduce immersion or comfort; active thermal passthrough supported temperature discrimination between real and rendered surfaces; and spatiotemporal rendering significantly improved immersion and realism compared with static thermal stimulation. These findings suggest that our design considerations are viable design strategies for AR thermal haptics, while also clarifying tradeoffs for applications that require precise realism versus broader immersive thermal experience.
Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System
Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.
VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation IROS 2025
The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object's acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck .
comment: Published at IROS 2025. 8 pages, 7 figures
TASC: Task-Aware Shared Control for Relational Telemanipulation IROS 2026
We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.
comment: Accepted to IROS 2026
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning CVPR 2026
Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.
comment: CVPR 2026
Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level
In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.
Reinforcement Twinning for Hybrid Control of Flapping-Wing Drones
Controlling flapping-wing drones requires controllers that handle time-varying, nonlinear, underactuated dynamics from incomplete, noisy sensor data. Recent advances in artificial intelligence (AI), particularly reinforcement learning (RL), have opened new perspectives for addressing such complex control problems through data-driven policy optimization from interaction with the environment. Yet purely data-driven methods are sample-inefficient, demanding extensive, sometimes unsafe exploration, especially without guiding physical models. This motivates hybrid AI-physics frameworks. This article proposes a hybrid model-free/model-based flight-control approach using the reinforcement twinning algorithm. The model-based (MB) component uses an adjoint formulation and an adaptive digital twin continuously identified from live trajectories; the model-free (MF) component uses RL. The two agents share knowledge via transfer learning, imitation learning, and shared experience between the real environment and the digital twin, coordinated by a policy referee that selects which agent acts in reality based on digital-twin performance and a real-to-virtual consistency ratio. The framework is evaluated for the longitudinal control of a flapping-wing drone, modelled as a nonlinear time-varying system driven by quasi-steady aerodynamic forces. The hybrid strategy is tested under three adaptive-model initializations: (1) offline identification from existing data, (2) random initialization with fully online identification, and (3) offline pre-training with biased parameters followed by online adaptation. In all cases, the hybrid framework improves performance, robustness, and sample efficiency over purely model-free and purely model-based approaches.
PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms
A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.
comment: 11 pages and 7 figures
Class-Incremental Motion Forecasting
Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.
comment: V3: Change title. Add further experiments
Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking
Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.
comment: Project Page: https://any2any.top/
An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts
Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.
DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps
Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.
GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking
This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2
comment: The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication
Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting ICML 2026
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
comment: ICML 2026. Project page: https://vap-project.github.io/
GenTrack: A New Generation of Multi-Object Tracking
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
comment: This work has been submitted to the IEEE for possible publication
UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation
Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.
comment: Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence
BIM Informed Visual SLAM for Construction Environments
Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.
comment: 9 pages, 7 tables, 4 figures
RoboSSM: Scalable In-context Imitation Learning via State-Space Models IROS 2026
In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.
comment: IROS 2026
AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning IROS 2026
Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at \url{https://youtu.be/TgsUm6bb7zg}, code and model checkpoints are available at \url{https://github.com/Zichen-Yan/AION}.
comment: Accepted to IROS 2026
Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators
Industrial manipulators typically operate in cluttered environments, where safe motion planning is critical. However, model uncertainties further complicate this task, which leads to conservative speed limits to reduce the influence of disturbances. Hence, there is a need for control methods that can guarantee safe motions which are executed fast. We address this by suggesting a novel model predictive control (MPC) solution for manipulators, where our two main components are a robust tube MPC and a corridor planning algorithm to obtain collision-free motion. Our solution results in a convex MPC formulation, which we can solve fast, making our method practically useful. We demonstrate the efficacy of our method in a simulated environment with a 6 DOF industrial robot operating in cluttered environments with uncertain model parameters. We outperform benchmark methods by tolerating higher levels of model uncertainty while achieving faster motion.
Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation
Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.
Stiffness Optimization for Concentrated Bending in Magnetically Actuated Catheters: Maintaining Steerability under Gradient Stiffness
Achieving both efficient pushability (propulsion transmission) and proximally concentrated bending for steerability is challenging for magnetically actuated soft catheters: higher axial/bending stiffness improves force transmission but reduces steerability, whereas lower stiffness enables large, proximally concentrated bending yet increases kinking/buckling risk under compressive push loads. To address this trade-off, we propose a stiffness-optimized multi-segment magnetically actuated catheter (SO-MAC) that integrates a decoupled steering-advancement mechanism with a gradient-stiffness architecture. The SO-MAC concentrates bending about a stable proximal pivot during advancement while the distal section passively self-straightens to transmit propulsion, aided by the optimized stiffness distribution and elastic recovery of the spring backbone against friction-induced kinking/buckling. Over $0{-}180^{\circ}$ combined steering and advancement, the pivot remained stable and the distal tip advanced near-straight toward the target direction. A 1.5 mm-diameter SO-MAC achieved up to $180^{\circ}$ steering with a 3 mm bending radius at its 10 mm tip, with an average shape error of $1.39 \pm 0.56$ mm and a steering-pivot error of $0.35 \pm 0.10$ mm. Visual feedback control in a bronchial phantom further confirmed robust navigation through highly curved, bifurcating paths.
Bistable Quad-Nets Composed of Four-Bar Linkages
We study a novel type of mechanical structures, composed of spatial four-bar linkages, that are bistable, that is, they allow for two distinct configurations. These structures have an interpretation as quad nets in the Study quadric which we use to prove existence of assemblies with an unbounded number of links and joints. We propose a purely geometric construction of such objects, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. In contrast to many other construction methods for bistable structures, our approach does not rely on numerical optimization and it allows for simple control of relevant geometric parameters such as axis positions and snap angles.
Deep Learning-Based Lunar Crater Terrain Relative Navigation
Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.
Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise ICRA
Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization
comment: 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)
A High-accuracy Event-based Underwater SLAM System
While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.
Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance
Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 74.58% success in edge-alignment and 89.42% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/
Model-Reference Adaptive Flight Control of a 95-mg Insect-Scale Flapping-Wing Aerial Robot
Due to the system's scale and complex fabrication, the model describing the dynamics of a flapping-wing insect-scale aerial robot is subject to parameter uncertainty; for example, in the inertia matrix and the actuator mapping of the flier. Furthermore, due to its low inertia, this type of robot is greatly affected by stochastic and systematic disturbances during flight, including power-wire tension, gusts, and undesired aerodynamic forces produced by wing misalignment. Therefore, the high-performance execution of complex maneuvers at the subdecigram scale requires the robot to adapt its behavior to counteract disturbances and model uncertainty. Toward this objective, we introduce a model-reference adaptive control (MRAC) architecture for high-performance position control of flapping-wing robotic insects that can be modeled as rigid bodies in the three-dimensional (3D) space. In addition, we demonstrate how the implementation of a hybrid multiplicative extended Kálmán filter for estimating current and desired angular velocities during flight significantly dampens attitude vibrations, especially along the roll and pitch degrees of freedom (DOFs), and also improves flight performance. To show the suitability, functionality, and high performance of the proposed approach, we conducted real-time hovering and trajectory-tracking 6-DOF flight control experiments with a 95-mg insect-scale aerial robot.
comment: Under review, 8 pages, 7 figures
SIMSplat: Language-Aligned 4D Gaussian Splatting for Driving Scenario Generation
Driving scene manipulation using real-world sensor data has emerged as a promising alternative to traditional driving simulators. Despite advances in language control and neural scene representations, existing methods treat grounding, editing, and simulation as loosely connected stages, relying on heuristic object localization, manual guidance, and single-agent validation, thereby constraining semantic expressiveness and hindering scalable, reactive scenario generation. We introduce SIMSplat, a driving scene editor built on scene-graph-based 4D Gaussian Splatting augmented with language-aligned features. By embedding appearance, motion, and location semantics directly into Gaussian scene-graph nodes, SIMSplat makes reconstructed scenes queryable through free-form natural language, bridging language understanding to object-level editing and multi-agent simulation within a single framework. Building on this language-grounded scene graph, SIMSplat supports diverse edits including fine-grained pedestrian manipulation, while a multi-agent path refinement module propagates changes across all agents to ensure reactive, physically plausible simulations. The pipeline further integrates with Vision-Language Models for automated scenario mining. Experiments show that SIMSplat more than doubles baseline grounding accuracy, achieves the highest task completion rate, and produces the lowest failure rates across diverse driving scenarios.
Multiagent Systems
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.
comment: 20 pages, 4 figures, 4 tables
An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements
The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.
Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs
We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.
A Multi-Agent system for Multi-Objective constrained optimization AAMAS 2026
Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.
comment: Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning
This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.
comment: 10 pages, 5 tables
ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research
Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.
comment: 9 pages, 6 figures
Blame is easier than praise: Measuring off-ball defensive performance in football
The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.
Deep-Unfolded Coordination
Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.
comment: The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables
Semiglobal Input-Delay Tolerance Algorithm for Distributed Nonconvex Optimization of Networked Nonlinear Systems
This paper studies a class of distributed optimization problems in networked nonlinear systems (NNSs) subject to input delays and consensus constraints. It introduces input-delay tolerant semiglobal convergence (IDTSC), meaning that for any prescribed compact initial set there exists an admissible delay bound under which the optimal solution is computed within consensus constraints and all node states converge to the solution. Building on a hierarchical design and input-to-state stability analysis, a new semiglobal input-delay tolerant (SIDT) algorithm is developed that practically achieves IDTSC for distributed optimization under the coupling between input delays and nonlinear dynamics. Further, by relaxing strict convexity requirements through the Polyak-Łojasiewicz condition, the SIDT algorithm broadens its applicability to nonconvex optimization. Finally, numerical experiments corroborate the theory on NNSs with input delays.
comment: 36 pages, 5 figures
Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience
Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behavior: how often they change their answer, and whether the change is corrective or harmful. We compare matched panels (homogeneous baseline, honest-mixed, and adversarial-mixed) and contaminated panels in which a malicious same-family peer is already present, spanning four model families and three reasoning benchmarks. An honest heterogeneous peer sharply lowers harmful revision, and an adversarial one reverses it. For Llama-3.1-70B defenders on MATH-hard, the honest-slot harmful-revision rate falls from 89% in the homogeneous panel to 35% with an honest peer, and an adversarial peer returns it to 90%. The conditional rate hides this damage on weak defenders, but the end-of-debate flip rate exposes it. The pattern keeps its sign across families and benchmarks while its magnitude varies with the defender-benchmark regime. We also measure the effects when an adversarial same-family peer is already present: an honest heterogeneous peer lowers both harmful revision and the rate at which initially-correct answers are lost. On the same Llama-3.1-70B setting, the added honest peer cuts the flip rate on initially-correct items from 31% under a same-family adversary to 6%. Heterogeneity is therefore not only an attack surface but, when an adversary is already present, also a defense.
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design EMNLP2026
Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.
comment: EMNLP2026
Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware
Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.
comment: 20 pages, 10 figures
Exit-and-Join Dynamics for Decentralized Coalition Formation
This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent's current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.
Heterogeneous Policy Networks for Composite Robot Team Communication and Coordination
High-performing human-human teams learn intelligent and efficient communication and coordination strategies to maximize their joint utility. These teams implicitly understand the different roles of heterogeneous team members and adapt their communication protocols accordingly. Multi-Agent Reinforcement Learning (MARL) has attempted to develop computational methods for synthesizing such joint coordination-communication strategies, but emulating heterogeneous communication patterns across agents with different state, action, and observation spaces has remained a challenge. Without properly modeling agent heterogeneity, as in prior MARL work that leverages homogeneous graph networks, communication becomes less helpful and can even deteriorate the team's performance. In the past, we proposed Heterogeneous Policy Networks (HetNet) to learn efficient and diverse communication models for coordinating cooperative heterogeneous teams. In this extended work, we extend Heterogeneous Policy Networks (HetNet) to support scaling heterogeneous robot teams. Building on heterogeneous graph-attention networks, we show that HetNet not only facilitates learning heterogeneous collaborative policies but also enables end-to-end training for learning highly efficient binarized messaging. Our empirical evaluation shows that HetNet sets a new state of the art in learning coordination and communication strategies for heterogeneous multi-agent teams by achieving an 5.84% to 707.65% performance improvement over the next-best baseline across multiple domains while simultaneously achieving a 200x reduction in the required communication bandwidth.
comment: IEEE Transactions on Robotics (T-RO)
Artificial collectives of specialists and generalists excel at different tasks
Collective artificial intelligence, where multiple agents work on shared tasks, holds potential to solve expansive problems in fields from medicine to collective governance. But while prescriptive engineering solutions abound, we lack descriptive scientific understanding of artificial collectives, and therefore principles for how to design resource efficient multi-agent systems. Through systematic experiments with optimizing agents, we characterize how agent interpretive abilities, rationality bounds, and task qualities interact to shape collective performance. Agents range from specialists, with narrow interpretive abilities, to generalists, with broad ones. Collectives of specialists correspond to sparse, centralized networks, while collectives of generalists correspond to dense, decentralized ones. We show that interpretive network properties have small performance effects on average (0.07 standard deviations of performance). However, for specific task qualities, these effects are 4.5 times larger (0.33 sd) and can reach much higher for certain task qualities (1.84 sd). This leads collectives of generalists to perform better on tasks that involve generating, choosing, and coordinating, while collectives of specialists with a few generalist mediators perform better on tasks that involve negotiating. Rationality bounds then moderate these relationships. At loose bounds, specialists outperform generalists through more effective sampling of high-dimensional decision spaces. At tight bounds, generalists outperform specialists through better gradient estimation. A fundamental trade-off between performance and convergence speed emerges at moderate bounds. These findings suggest that multi-agent design could benefit from matching interpretive networks to both task demands and agents' computational limits, with implications for the efficiency and energy costs of multi-agent systems.
comment: 10 pages, 4 figures
Process-Reward Tactic Evolution for Long-Horizon Bioinformatics Workflows
LLM agents can write code and call tools, but reliable bioinformatics work requires long-horizon interaction with workflow software, typed data objects, provenance, and biological checks. We study this setting through Galaxy workflow execution. The agent must explore task data, construct or adapt an executable workflow DAG, bind inputs and dataset collections, monitor execution, debug failures, and validate biological outputs. We propose Process-Reward Tactic Evolution, a Galaxy-based training framework that turns verified workflow rollouts into reusable \tactics. During training, agents practice on curriculum-organized Galaxy tasks in Agent Gym; process verifiers score workflow construction, software interaction, execution, and biological correctness; successful and failed traces are distilled into a tactic library. At inference, the trained executor, Process-Reward Tactic Evolution, uses this library to execute held-out peer reviewed Galaxy workflow converted BioWorkflow Bench and BioAgent Bench tasks in isolated environments. The paper evaluates whether process-supervised tactic accumulation improves long-horizon bioinformatics workflow completion, biological correctness, and execution efficiency over no-memory and reflection-style baselines.
Integrating Large Language Model Agents with Digital Twins for Industrial Autonomous Systems
Industrial automation is being transformed by digitalization and the increasing use of cyber-physical systems. Modern production environments require greater adaptability, faster reconfiguration, and more intuitive human-machine interaction. However, traditional rule-based systems rely on fixed logic and cannot autonomously adapt to changing conditions. Consequently, current automation systems lack a systematic approach for integrating adaptive and generalizable reasoning capabilities for interpreting, planning, and executing user tasks across dynamic environments and heterogeneous components. This dissertation proposes a three-layer framework that integrates large language models (LLMs), digital twins, and automation systems into an autonomous system. Autonomy is defined as a design property assigned to system components and enabled through LLM-based reasoning to achieve adaptive, goal-oriented behavior. The Task-Process-Service-Resource (TPSR) model is introduced to transform user tasks into executable processes. Four LLM roles are identified: process orchestration, service matching, digital resource generation, and agent-as-a-service. Five peer-reviewed studies develop and refine these concepts using the design science research methodology. Case studies and prototypes demonstrate adaptive task planning, event-driven control, simulation-based parameterization, and digital model generation. Results show high task executability, command correctness, and content-generation accuracy while reducing manual effort. The framework enables the integration of LLM-based reasoning into industrial automation systems and improves adaptability and usability. Limitations include dependence on accurate digital representations, the computational demands of LLMs, and the need for human intervention in safety-critical situations.
comment: Doctoral Dissertation, University of Stuttgart. Doctoral Exam Video Recording: https://youtu.be/Mhd9LiV5TKE
Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator
Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.
UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation
Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.
comment: Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence
Systems and Control (EESS)
Topological Data Analysis for High-Dimensional Dynamic Process Monitoring
Real-time process monitoring requires methods that extract actionable information from high-dimensional time-series data. In this work, we present a new approach for process monitoring that combines tools of topological data analysis (TDA) and machine learning. In the proposed approach, we represent multivariate time-series data as manifolds and use topological descriptors to summarize the structure of such data; we then use a neural ordinary differential equation to learn the dynamic evolution of the topological structure of the system. Using real data from an industrial process, we show that this trajectory-based event detection approach is effective at detecting diverse types of events. We contrast this approach against reconstruction-based approaches such as principal component analysis and autoencoders and against a trajectory-based approach that uses Koopman autoencoders.
PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies
Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.
Sparse add-on controller design: A Youla approach to system-level performance
The performance of high-tech systems is often dictated by a few performance objectives shared among the many closed-loop controlled subsystems operating in the machine, such as synchronization, coordination, and alignment, which necessitates control methods that explicitly address them to achieve optimal performance. The aim of this paper is to introduce a framework that improves system performance through system-level controllers designed to be implemented as add-ons to the existing subsystem control structure. The developed method parametrizes all stabilizing system-level add-on controllers using the Youla framework, enabling a convex formulation of the sparse $\mathcal{H}_2$ synthesis problem. The result is a sparse add-on controller that achieves the optimal trade-off between combined performance and interconnection complexity, as demonstrated through numerical simulations.
Data-Driven Control from Poisoned Data: Fundamental Limitations and Secure DeePC
We study a data-driven control problem in the presence of arbitrary data poisoning attacks. We assume that a subset of offline output data is stored in unprotected locations and may be poisoned by an adversary. We first establish fundamental limitations for data-driven control arising from such poisoned data: poisoning attacks are not detected/identified from the dataset alone; unprotected data are non-informative for controller design with worst-case guarantees; and hard constraints on unprotected outputs are not certifiable. Motivated by these limitations and the data-enabled predictive control (DeePC) technique, we propose Secure DeePC, a data-driven control algorithm that is resilient against poisoning attacks. It first runs output-truncated DeePC using only the protected dataset until the online input becomes persistently exciting. It then uses online measurements to reconstruct the partial offline dataset, and finally returns to full-output DeePC. Secure DeePC achieves MPC-equivalent performance in finite time almost surely under certain conditions. Simulation results illustrate the efficacy of the proposed framework against poisoning attacks.
Techno-Economic Analysis of Shared Mobile Storage for Demand Charge Reduction
This paper investigates the techno-economic viability of shared electric vehicle (EV) fleets for demand charge reduction under practical logistical and operational constraints. Unlike idealized models that overlook transit overheads, we propose a high-fidelity fleet management framework that explicitly accounts for the spatio-temporal coupling of energy consumption, labor costs for EV drivers, and battery degradation. We formulate the dispatch problem as a mixed-integer linear program (MILP) that jointly minimizes demand charges and total cost of ownership. To address the computational complexity arising from path-dependent constraints, we develop a marginal-value-based heuristic algorithm that achieves near-optimal performance with high computational efficiency. Using real-world data from San Francisco, our analysis reveals that a modest number of EVs can achieve significant demand charge savings, sufficient to recover the ownership and operational expenses. Our results also show how tariff structures, fleet size, and cost components influence overall profitability.
comment: 22 pages, 26 figures, journal
Contraction-based Neural Control for Cooperative Aerial Payload Transportation with Variable-length Cables
This paper presents a novel neural nonlinear control framework for a multi-drone slung payload system with variable-length cables and a rigid-body payload. The equations of motion are formulated into a decoupled structure, where the payload and cable length dynamics are governed by independent control channels, facilitating modularized controller design on reduced-order subsystems. A neural control contraction metric (CCM) controller and a neural feedback controller are jointly trained to enforce contraction conditions for the payload subsystem. Separately, a cable length control law is derived that exploits the variable-length degree of freedom for obstacle avoidance. Numerical simulations demonstrate trajectory tracking of a rigid-body payload and gate traversal capabilities of the overall system under the proposed control framework.
comment: Submitted for publication in AIAA Scitech 2027
Nodal Braess's Paradox and Inertia Destabilization with Dynamic Node and Line Failures in Power Grids
Large-scale power outages are typically caused by cascading failures. These unfold dynamically through complex interactions between network dynamics and individual component failures. In contrast, the study of cascading failures in physics has focused on analyzing line overloads in the quasi-static regime. We introduce a new model that integrates the dynamics of node and line failures with a paradigmatic oscillator model for power grid synchronization. This enables us to investigate the collective cascading behavior of coupled failures for the first time. We study the impact of nodal robustness, the ability of nodes to tolerate transient disturbances, and inertia, the ability of nodes to resist frequency deviations, on cascade sizes. We discover two novel mechanisms driving system fragility: i) While low inertia is widely considered a major challenge for power grids, we find that high inertia can amplify cascade sizes unless accompanied by appropriate adjustments of other dynamical properties. ii) Further, we find that an increase in the robustness of individual nodes can paradoxically lead to larger cascades. This latter phenomenon constitutes a novel type of Braess's paradox. Understanding such counterintuitive collective effects may become central for achieving resilient future power grids.
Semiglobal Input-Delay Tolerance Algorithm for Distributed Nonconvex Optimization of Networked Nonlinear Systems
This paper studies a class of distributed optimization problems in networked nonlinear systems (NNSs) subject to input delays and consensus constraints. It introduces input-delay tolerant semiglobal convergence (IDTSC), meaning that for any prescribed compact initial set there exists an admissible delay bound under which the optimal solution is computed within consensus constraints and all node states converge to the solution. Building on a hierarchical design and input-to-state stability analysis, a new semiglobal input-delay tolerant (SIDT) algorithm is developed that practically achieves IDTSC for distributed optimization under the coupling between input delays and nonlinear dynamics. Further, by relaxing strict convexity requirements through the Polyak-Łojasiewicz condition, the SIDT algorithm broadens its applicability to nonconvex optimization. Finally, numerical experiments corroborate the theory on NNSs with input delays.
comment: 36 pages, 5 figures
A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data
Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.
Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes
Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.
comment: 6 pages, 7 figures
A Unified Framework for Joint Sensor Placement and Scheduling for Intrusion Detection
We consider an intrusion detection task in which a defender must jointly optimize sensor placement locations and orientations to minimize the probability of missed detection of an intruder traversing a protected environment. We decompose this problem into a meta problem, termed SensorPlacement, and an embedded subproblem, termed OrientationScheduling. The OrientationScheduling subproblem, for a fixed sensor placement, is modeled as a 2-player zero-sum game between the defender and the intruder, where the defender seeks an orientation strategy for the deployed sensors to minimize the probability of missed detection, while the intruder seeks a path selection strategy to maximize it. Since the defender's strategy space grows combinatorially with the number of sensors and orientations, solving the game via standard linear programming becomes prohibitive. To this end, we develop an iterative and efficient equilibrium-seeking algorithm that exploits the structure of the game's payoff function and establishes theoretical guarantees for convergence to the Nash equilibrium (NE) of the game. This NE value is then used as a utility measure in the SensorPlacement meta problem. We show that this game-value-based utility function is weakly submodular over the set of sensor placements and propose a greedy placement algorithm with near-optimality guarantees. To our knowledge, this is the first unified framework to integrate game-theoretic utility design with (weak) submodular optimization, enabling principled joint optimization of sensor placement and orientation scheduling. Through extensive simulations, we demonstrate that the proposed approach achieves near-optimal detection performance while significantly reducing computation time compared to baselines.
comment: 27 pages, 4 figures
Exit-and-Join Dynamics for Decentralized Coalition Formation
This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent's current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.
Learning Neural Maximal Lyapunov Functions on $\mathsf{SO}(n)$
Establishing stability guarantees for dynamical systems on Lie groups is a fundamental challenge, as classical Lyapunov methods developed for Euclidean spaces do not directly transfer to curved geometries. In this paper, we propose a framework for learning maximal Lyapunov functions for systems evolving on the special orthogonal group $\mathsf{SO}(n)$. Theoretically, we introduce a neural Lyapunov architecture based on the logarithmic map with proven approximation capabilities, and we formulate the learning problem via a Zubov-type characterization of the maximal region of attraction. A key technical contribution is the derivation of explicit, numerically tractable formulas for the derivative of the logarithmic map, enabling training through a two-phase algorithm that balances computational efficiency and accuracy. Empirically, we validate the approach on a low-dimensional nonlinear system.
comment: Accepted to IEEE Control Systems Letters (L-CSS), 6 pages, 2 figures,
Formalizing Task-Space Complexity for Zero-Shot Generalization
Policies must operate across diverse conditions, yet a single policy is often conservative while fully adaptive schemes can be complex. We study zero-shot generalization in contextual dynamical systems and introduce a performance-centric, directional task dissimilarity--the signed divergence--that upper bounds the generalization gap from a source context to a target context. The signed divergence induces $\varepsilon$-tolerance sets that certify when a source policy class generalizes, and it yields a concrete notion of task-space complexity: the minimum number of source contexts needed so that every target context incurs at most $\varepsilon$ generalization gap. Under a mild local smoothness assumption on performance, the induced tolerance sets admit certified inner/outer balls and instance-dependent volume bounds on task-space complexity. In the finite-oracle setting, source selection reduces to set cover; a greedy strategy inherits the standard $H(n)$ approximation guarantee. Using a Mass-Spring-Damper system with linear-quadratic regulator (LQR) controllers and a nonlinear CartPole system with deep reinforcement learning controllers, we show that greedy selection achieves the same $\varepsilon$-coverage with fewer policies than uniform or random baselines. Our approach delivers a performance-based task similarity measure and practical certificates for building generalizable control with simple policies.
Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering
Executable evaluation -- checking the consequences of an agent's actions with a program rather than grading its prose -- has become a prominent way to assess tool-using AI agents in software settings. Electric power engineering has not yet had an analogous benchmark: language-model use is still dominated by retrieval and text question answering, while agents acting on power-system artifacts remain mostly academic prototypes. We introduce the Power Systems Agent Benchmark, an executable benchmark for power-engineering agents. An agent receives a structured task and returns a structured solution; a deterministic evaluator recomputes the engineering quantities, checks operational constraints, and returns a feasibility flag, a normalized score, and explicit violations. The benchmark contains 41 task families across eight areas of power engineering, from power flow and protection to stability, microgrids, reliability, power quality, and forecasting. Each task is grounded in a citable source, standard, or documented engineering formulation. To resist contamination, held-out cases are synthesized on demand by per-family generators from private seeds: the construction is inspectable, but the instances remain private. In a reference evaluation with three command-line agents, the strongest score near the compact tier's ceiling, a smaller open model trails, and public and held-out performance are broadly consistent; a separate public-split grid with OpenCode and Aider probes harness effects. The reference evaluation doubles as quality control: unanimous failures flag candidate task or evaluator defects, and it exposed a latent evaluator bug missed by self-consistency checks. The evaluators are compact deterministic surrogates, but the task contract allows their internals to be upgraded to simulator-backed checks without changing how tasks are posed or solved.
comment: 19 pages, 1 figure, 2 tables. Code and data: https://github.com/trashchenkov/power-systems-agent-benchmark ; archived at https://doi.org/10.5281/zenodo.20753046
Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather, Calendar, and COVID-19 Indicators
Accurate short-term electricity demand forecasting is critical for reliable power system operation, energy market planning, and infrastructure optimization. This paper presents a hybrid framework combining a Transformer encoder for temporal feature extraction with gradient-boosted decision trees (XGBoost) for daily electricity demand forecasting across New England. The framework integrates meteorological observations from six cities spanning all six New England states, calendar and holiday effects, autoregressive demand lags, and COVID-19 epidemiological variables. Hyperparameter optimization uses Optuna with a multivariate Tree-structured Parzen Estimator over 500 trials, with a leakage-free 70/15/15 chronological train-validation-test split. The hybrid model achieves a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906. A tabular-only XGBoost baseline achieves RMSE of 9,304 MWh, MAPE of 2.21%, and R-squared of 0.896. A Diebold-Mariano test (Harvey-Leybourne-Newbold correction) confirms the 427.7 MWh difference is statistically indistinguishable from noise (DM = -1.126, p = 0.262). An ablation study reveals COVID-19 features improved training accuracy but had asymmetric test effects: removal degraded hybrid RMSE by 3.2% while marginally improving XGBoost-only by 1.2%. A SHAP temporal analysis shows 5 of 8 COVID features rank higher on the post-acute test set than during pandemic-active training, indicating the model over-applies learned pandemic patterns. These findings establish temporal validity decay as a central mechanism: behavioral disruptions drove a strong COVID-demand signal during 2020-2021, but adaptation was complete by mid-2022, leaving epidemiological features as noise amplifying overfitting to stale pandemic patterns.
A Metaheuristic Framework for Optimized HAPS-Aided Localization in Urban Areas
High-altitude platform stations (HAPS), originally designed for communication services, can also provide structured signals of opportunity (SoOP) to augment the global navigation satellite system (GNSS). However, dense urban environments introduce severe blockage and non-line-of-sight (NLOS) conditions that undermine GNSS accuracy and render geometric placement metrics insufficient. To address this, we propose a metaheuristic framework for jointly optimizing the number and placement of HAPS under practical constraints by integrating high-fidelity 3D city models, ray-tracing, and multi-objective optimization to handle the discrete and highly non-convex design space. Three metaheuristic solutions based on distinct search principles are developed to efficiently explore the solution space, all demonstrating rapid convergence and consistently outperforming a greedy baseline, particularly in the low-to-moderate HAPS regime. For representative dense urban scenarios, we show that four HAPS are sufficient to satisfy an 18-m average 3D positioning error bound (PEB) threshold, while configurations with two to five HAPS achieve over 50\% reduction in mean and root mean square (RMS) PEB and up to 94\% and 87\% reduction in standard deviation and coefficient of variation (CV), respectively, compared to the satellite-only case. Diminishing returns are observed beyond six HAPS due to geometric redundancy, emphasizing the importance of optimized placement. The framework further demonstrates strong robustness and generalizability across diverse urban environments with varying building morphology and propagation conditions, establishing it as an effective and scalable solution for HAPS-assisted localization in realistic urban settings.
comment: 18 pages, 40 figures
Eigenspace-Based Clustering for Personalized System Identification
We study the problem of system identification in heterogeneous settings, where different systems may follow distinct underlying dynamics. Existing clustered system identification approaches often rely on iterative training-based cluster assignment, which can be sensitive to learning uncertainty and model initialization. In contrast, we propose a one-shot, training-free clustering method that identifies similar systems using the structure of their locally observed data. Specifically, each system estimates a local state covariance matrix, and cluster identities are inferred by measuring the alignment between the leading covariance eigenspaces of different systems. We provide a mathematical interpretation of the proposed similarity score and develop a finite-sample analysis that characterizes how covariance estimation error induces eigenspace perturbations in terms of the underlying system dynamics. We then derive a probability bound for pairwise false merges and a global clustering success guarantee. Numerical experiments demonstrate that the proposed eigenspace-based clustering method effectively identifies systems with shared dynamics, leading to lower personalized model-estimation error compared with training-based clustering and non-clustered baselines.
Integrating Large Language Model Agents with Digital Twins for Industrial Autonomous Systems
Industrial automation is being transformed by digitalization and the increasing use of cyber-physical systems. Modern production environments require greater adaptability, faster reconfiguration, and more intuitive human-machine interaction. However, traditional rule-based systems rely on fixed logic and cannot autonomously adapt to changing conditions. Consequently, current automation systems lack a systematic approach for integrating adaptive and generalizable reasoning capabilities for interpreting, planning, and executing user tasks across dynamic environments and heterogeneous components. This dissertation proposes a three-layer framework that integrates large language models (LLMs), digital twins, and automation systems into an autonomous system. Autonomy is defined as a design property assigned to system components and enabled through LLM-based reasoning to achieve adaptive, goal-oriented behavior. The Task-Process-Service-Resource (TPSR) model is introduced to transform user tasks into executable processes. Four LLM roles are identified: process orchestration, service matching, digital resource generation, and agent-as-a-service. Five peer-reviewed studies develop and refine these concepts using the design science research methodology. Case studies and prototypes demonstrate adaptive task planning, event-driven control, simulation-based parameterization, and digital model generation. Results show high task executability, command correctness, and content-generation accuracy while reducing manual effort. The framework enables the integration of LLM-based reasoning into industrial automation systems and improves adaptability and usability. Limitations include dependence on accurate digital representations, the computational demands of LLMs, and the need for human intervention in safety-critical situations.
comment: Doctoral Dissertation, University of Stuttgart. Doctoral Exam Video Recording: https://youtu.be/Mhd9LiV5TKE
Integrated Exploration-Aware UAV Route Optimization and Path Planning
Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.
A graph-informed regret metric for optimal distributed control
We consider the optimal control of large-scale systems using distributed controllers whose network topology mirrors the coupling graph between subsystems. In this work, we introduce spatial regret, a graph-informed metric measuring the worst-case performance gap between a distributed controller and an oracle with access to additional sensor information. The oracle's graph is a user-specified augmentation of the information graph, yielding a benchmark policy that penalizes disturbances for which additional sensing would improve performance. Minimizing spatial regret yields distributed controllers - respecting the nominal information graph - that emulate the oracle's response to disturbances characteristic of large-scale networks, such as localized perturbations. We show that minimizing spatial regret admits a convex reformulation as an infinite program with a finite-dimensional approximation. To scale to large networks, we derive an upper bound on the spatial regret that can be efficiently minimized in a distributed way. Numerical experiments on power-system models show that the resulting controllers mitigate localized disturbances more effectively than those based on classical metrics.
A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation CEC
Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.
comment: This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore
Immersive and Wearable Thermal Rendering for Augmented Reality
We present a proof-of-concept palm-mounted thermal feedback prototype addressing thermal rendering challenges specific to augmented reality (AR), where users must interact with both real and virtual objects in their physical workspace. In contrast to thermal feedback systems developed for virtual reality, AR thermal feedback must preserve manual dexterity, maintain access to real-world thermal cues, and provide coherent virtual temperature sensations without obstructing natural object interaction. We propose three AR-specific design considerations, which our prototype implements: indirect feedback to preserve fingertip dexterity, active thermal passthrough to sense and render the temperature of contacted physical surfaces, and spatially and temporally varying thermal rendering across the palm. Human-subject experiments evaluated perceptual sensitivity, indirect feedback, active thermal passthrough, spatial pattern recognition, and moving thermal rendering during AR interaction. Results showed that although indirect feedback reduced perceived realism during visual contact at the fingertips, it did not reduce immersion or comfort; active thermal passthrough supported temperature discrimination between real and rendered surfaces; and spatiotemporal rendering significantly improved immersion and realism compared with static thermal stimulation. These findings suggest that our design considerations are viable design strategies for AR thermal haptics, while also clarifying tradeoffs for applications that require precise realism versus broader immersive thermal experience.
A graph neural network surrogate model for mesh-based crashworthiness prediction of vehicle panel components
Crashworthiness is a key performance measure in the design of safety-critical vehicle panel components such as B-pillars. Finite element (FE) simulations are widely used to evaluate crash responses but remain computationally expensive for large-scale, nonlinear impact scenarios, particularly when integrated into iterative design and optimisation processes. Although machine learning-based surrogate models have been developed for rapid crashworthiness analysis, they exhibit limitations in detailed representation of complex 3-dimensional components. Graph Neural Networks (GNNs) have emerged as a promising solution for processing data with complex structures. However, existing GNN models often lack sufficient accuracy and computational efficiency to meet industrial demands. This paper proposes Recurrent Graph U-Net (ReGUNet), a graph-based surrogate model for crashworthiness analysis of vehicle panel components. By representing FE meshes in graph form, the model naturally accommodates complex irregular structural geometries. Its hierarchical architecture improves computational efficiency and accuracy, while the introduction of recurrence enhances stability of temporal predictions over multiple time steps. A side-impact case study of hot-stamped steel B-pillars with varying geometries is used to generate training dataset. The trained model demonstrates high accuracy in predicting the dynamic deformation behaviour and crashworthiness indicators of previously unseen component designs. ReGUNet achieves over a 52% reduction in the average deformation prediction error relative to baseline methods, together with markedly improved computational efficiency. ReGUNet provides rapid and reliable crashworthiness assessments, which in turn accelerates the design cycle of vehicle panel components.
comment: Accepted manuscript version. Final published version available in Results in Engineering via DOI: 10.1016/j.rineng.2026.110925
Distributed Resilient State Estimation and Control with Strategically Implemented Security Measures
This paper addresses the problem of distributed resilient state estimation and control for linear time-invariant systems in the presence of malicious false data injection sensor attacks and bounded noise. We consider a system operator (defender) capable of deploying cybersecurity measures to counteract the sensor compromises. Although such measures enhance resilience against adversarial attacks, they may incur substantial costs; hence, it is crucial to select countermeasures to balance resilience gains and cost efficiency strategically. We first demonstrate that the system's resilience against attacks is maximized through the appropriate implementation of security measures, implying that no attacker can execute undetectable sensor attacks. Building on this analysis, we propose an algorithm that identifies the optimal security measure. While determining this measure is NP-hard in general, we also derive sufficient conditions under which efficient computation is feasible. Furthermore, we develop a distributed resilient state estimation and control scheme informed by the optimal security measure and establish conditions that guarantee bounded estimation and control errors. Finally, we validate the efficacy of our approach via numerical simulations of a vehicle platooning scenario.
comment: Accepted for publication in the IEEE Transactions on Automatic Control
Auction-Based Responsibility Allocation for Scalable Decentralized Safety Filters in Cooperative Multi-Agent Collision Avoidance
This paper proposes a scalable decentralized safety filter for multi-agent systems based on high-order control barrier functions (HOCBFs) and auction-based responsibility allocation. While decentralized HOCBF formulations ensure pairwise safety under input bounds, they face feasibility and scalability challenges as the number of agents grows. Each agent must evaluate an increasing number of pairwise constraints, raising the risk of infeasibility and making it difficult to meet real-time requirements. To address this, we introduce an auction-based allocation scheme that distributes constraint enforcement asymmetrically among neighbors based on local control effort estimates. The resulting directed responsibility graph guarantees full safety coverage while reducing redundant constraints and per-agent computational load. Simulation results confirm safe and efficient coordination across a range of network sizes and interaction densities.
comment: 6 pages, 3 figures, accepted for presentation at the IFAC World Congress 2026
HBS -- Hardware Build System: Characterizing and comparing direct-Tcl and indirect-abstract approaches for hardware build systems
Build systems become an indispensable part of the software implementation and deployment process. New programming languages are released with the build system integrated into the language tools, for example, Go, Rust, or Zig. However, in the hardware description domain, no official build systems have been released with the predominant Hardware Description Languages (HDL) such as VHDL or SystemVerilog. Moreover, hardware design projects are often multilingual. The paper characterizes and compares two common approaches for hardware build system implementations. The first one, the direct-Tcl approach, in which the build system code is executed directly by the EDA tool during the design build flow. The second one, the indirect-abstract approach, in which the build system produces a Tcl script, which is later run by a proper EDA tool. As none of the existing direct-Tcl build systems was close to the indirect-abstract build systems in terms of supported functionalities, the paper also presents a new direct-Tcl hardware build system called HBS. The implemented build system was used as a representative of direct-Tcl build systems in comparison with indirect-abstract build systems.
Robust Safety Filters for Lipschitz-Bounded Adaptive Closed-Loop Systems with Structured Uncertainties
Adaptive control provides closed-loop stability and reference tracking for uncertain dynamical systems through online parameter adaptation. These properties alone, however, do not ensure safety in the sense of forward invariance of state constraints, particularly during transient phases of adaptation. Control barrier function (CBF)-based safety filters have been proposed to address this limitation, but existing approaches often rely on conservative constraint tightening or static safety margins within quadratic program formulations. This paper proposes a reference-based adaptive safety framework for systems with structured parametric uncertainty that explicitly accounts for transient plant-reference mismatch. Safety is enforced at the reference level using a barrier-function-based filter, while adaptive control drives the plant to track the safety-certified reference. By exploiting Lipschitz bounds on the closed-loop tracking error dynamics, a tracking-error-dependent robust CBF condition is derived and equivalently reformulated as a convex second-order cone program (SOCP). The proposed safety-filter formulation reduces conservatism relative to fixed-margin CBF formulations by rendering the resulting safety constraints progressively less restrictive as the plant-reference tracking error decreases, while preserving formal guarantees of forward invariance and closed-loop stability.
comment: 6 pages, 4 figures, accepted for publication in the IEEE Control Systems Letters (L-CSS)
Decentralized CBF-based Safety Filters for Collision Avoidance of Cooperative Missile Systems with Input Constraints
This paper presents a decentralized safety filter for collision avoidance in multi-agent aerospace interception scenarios. The approach leverages robust control barrier functions (RCBFs) to guarantee forward invariance of safe sets under bounded inputs and high-relative-degree dynamics. Each effector executes its nominal cooperative guidance command, while a local quadratic program (QP) modifies the input only when necessary. Event-triggered activation based on range and zero-effort miss (ZEM) criteria ensures scalability by restricting active constraints to relevant neighbors. To ensure feasibility under multiple simultaneously active constraints, a slack-variable relaxation scheme is introduced that prioritizes critical agents in a Pareto-optimal manner. Simulation results in many-on-many interception scenarios demonstrate that the proposed framework maintains collision-free operation with minimal deviation from nominal guidance, providing a computationally efficient and scalable solution for safety-critical multi-agent aerospace systems.
comment: 7 pages, 5 figures, accepted for presentation at the 2026 American Control Conference (ACC 2026)
Control Allocation Algorithm for Hypersonic Glide Vehicles with Input Limitations
Hypersonic glide vehicles (HGVs) operate in challenging flight regimes characterized by strong nonlinearities in actuation and stringent physical constraints. These include state-dependent actuator limitations, asymmetric control bounds, and thermal loads that vary with maneuvering conditions. This paper introduces an iterative control allocation method to address these challenges in real time. The proposed algorithm searches for control inputs that achieve the desired moment commands while respecting constraints on input magnitude and rate. For slender HGV configurations, thermal loads and drag generation are strongly correlated-lower drag typically results in reduced surface heating. By embedding drag-sensitive soft constraints, the method improves energy efficiency and implicitly reduces surface temperatures, lowering the vehicle's infrared signature. These features are particularly advantageous for long-range military operations that require low observability. The approach is demonstrated using the DLR's Generic Hypersonic Glide Vehicle 2 (GHGV-2) simulation model. The results confirm the method's effectiveness in maintaining control authority under realistic, constrained flight conditions.
comment: 43pages, 21 figures, accpeted for publication in the AIAA Journal of Guidance, Control, and Dynamics
An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts
Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.
Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks
This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model otel-llm-1b-it confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.
comment: 7 pages, 4 figures
Robust Output Regulation of Uncertain Linear Time-Varying Systems
Robust output regulation for linear time-varying systems has remained an open problem for decades. By augmenting the classical immersion viewpoint, we propose the trajectory-matching system immersion framework. It reformulates the regulator equation as a forced system, and demonstrates that finding an internal model is equivalent to reproducing the non-decaying output trajectories of this forced system by constructing an unforced one. This perspective yields an exact algebraic boundary for finite-dimensional internal models, termed finite linear parameterization. It further reveals a distinctive obstruction in time-varying systems: even highly structured, finite-dimensional affine parametric uncertainties can generate infinite-dimensional families of non-decaying error-zeroing signals, thereby precluding exact robust regulation via linear finite-dimensional internal models in general. Hence, we develop a comprehensive approximate robust design, which yields a bounded tracking error that can be arbitrarily small, and avoids explicitly solving the regulator equation. Additionally, it recovers exact regulation when the uncertainty influences the system in some specified ways. Overall, these results clarify the intrinsic limitation of exact finite-dimensional robust regulation for uncertain LTV systems, and provide a general, executable framework for constructing an internal model-based design.
An approach to the LQG/LTR design problem with specifications for finite-dimensional SISO control systems
This is an expository paper which discusses an approach to the linear quadratic Gaussian/loop transfer recovery (LQG/LTR) design problem for finite-dimensional single-variable (single-input/single-output, SISO) control systems. The approach is based on the utilisation of weighting augmentation for incorporating design specifications into the framework of the LTR technique for LQG compensator design. The LQG compensator is to simultaneously meet given analytical low- and high-frequency design specifications expressed in terms of desirable sensitivity and controller noise sensitivity functions. The paper is aimed at non-specialists and, in particular, practitioners in finite-dimensional LQG theory interested in the design of feedback compensators for closed-loop performance and robustness shaping of SISO control systems in realistic situations. The proposed approach is illustrated by a detailed design example: the torque control of a geared DC motor with an elastically mounted output shaft.
comment: typos corrected; references added; additional computational details added
Utility-Aware DRL-Based TXOP Adaptation for NR-U and Wi-Fi Coexistence Networks
The coexistence of NR-U and Wi-Fi in the unlicensed spectrum introduces a challenging resource management problem, where heterogeneous channel access mechanisms can lead to unbalanced spectrum utilization and severe Wi-Fi performance degradation. To address this issue, this paper proposes a utility-aware deep reinforcement learning (DRL) framework for adaptive transmission opportunity (TXOP) control in NR-U/Wi-Fi coexistence networks. The coexistence process is formulated as a Markov decision process (MDP), in which the NR-U TXOP duration is treated as a controllable variable for regulating post-access channel occupancy. A deep Q-network (DQN) is then employed to learn adaptive TXOP control policies through online interaction with the coexistence environment. A key feature of the proposed framework is the integration of a configurable reward and criterion design, which enables explicit control of the fairness-efficiency-utility tradeoff. Three operating policies are developed, namely absolute fairness, moderate fairness, and utility-oriented moderate fairness, to characterize different coexistence operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared with the absolute fairness policy, the moderate fairness policy improves aggregate throughput by 68.22%, while the utility-oriented policy achieves a 177.6% improvement under the adopted utility evaluation metric. These results demonstrate that the proposed utility-aware DRL framework provides an effective and flexible solution for adaptive TXOP control and tradeoff management in heterogeneous unlicensed coexistence networks.
comment: 15 pages, 13 figures, 2 tables, submitted to IEEE Open Journal of the Communications Society
Complex Frequency as Generalized Eigenvalue
This paper shows that the concept of complex frequency, originally introduced to characterize the dynamics of signals with complex values, constitutes a generalization of eigenvalues when applied to the states of linear time-invariant (LTI) systems. Starting from the definition of geometric frequency, which provides a geometrical interpretation of frequency in electric circuits that admits a natural decomposition into symmetric and antisymmetric components associated with amplitude variation and rotational motion, respectively, we show that complex frequency arises as its restriction to the two-dimensional Euclidean plane. For LTI systems, it is shown that the complex frequencies computed from the system's states subject to a non-isometric transformation, coincide with the original system's eigenvalues. This equivalence is demonstrated for diagonalizable systems of any order. The paper provides a unified geometric interpretation of eigenvalues, bridging classical linear system theory with differential geometry of curves. The paper also highlights that this equivalence does not generally hold for nonlinear systems. On the other hand, the geometric frequency of the system can always be defined, providing a geometrical interpretation of the system flow. A variety of examples based on linear and nonlinear circuits illustrate the proposed framework.
A Geometric Analysis-Based Safety Assessment Framework for Marine Vehicle Route Decision-Making
This paper develops a Geometric Analysis-based Route Safety Assessment (GARSA) framework to enhance the safety of marine vehicles navigating in restricted waters. Utilizing line and point geometric elements to define waterway boundaries, the framework enables the construction of a dynamic width characterization function to quantify spatial safety within complex restricted navigation spaces. An iterative method is developed to calculate the dynamic width characterization function, enabling an abstracted spatial property representation of waterways. Based on this, a navigational safety index that accounts for the overall route safety as well as the spatial constraints of local segments is created, enabling the selection of the safest waterway to pass through. Case studies in Hamburg Port and hypothetical waterways show that GARSA identifies spatial safety differences among routes. For example, GARSA assigned safety assessment values of 0.6 and 135 to two intra-port transit routes of 7,258 m and 7,845 m, respectively, with the higher value indicating a safer route. Overall, GARSA provides a quantitative basis for safety-oriented route decision-making of marine vehicles in restricted waters.
comment: Accepted for publication in Reliability Engineering & System Safety (RESS)
Model-Reference Adaptive Flight Control of a 95-mg Insect-Scale Flapping-Wing Aerial Robot
Due to the system's scale and complex fabrication, the model describing the dynamics of a flapping-wing insect-scale aerial robot is subject to parameter uncertainty; for example, in the inertia matrix and the actuator mapping of the flier. Furthermore, due to its low inertia, this type of robot is greatly affected by stochastic and systematic disturbances during flight, including power-wire tension, gusts, and undesired aerodynamic forces produced by wing misalignment. Therefore, the high-performance execution of complex maneuvers at the subdecigram scale requires the robot to adapt its behavior to counteract disturbances and model uncertainty. Toward this objective, we introduce a model-reference adaptive control (MRAC) architecture for high-performance position control of flapping-wing robotic insects that can be modeled as rigid bodies in the three-dimensional (3D) space. In addition, we demonstrate how the implementation of a hybrid multiplicative extended Kálmán filter for estimating current and desired angular velocities during flight significantly dampens attitude vibrations, especially along the roll and pitch degrees of freedom (DOFs), and also improves flight performance. To show the suitability, functionality, and high performance of the proposed approach, we conducted real-time hovering and trajectory-tracking 6-DOF flight control experiments with a 95-mg insect-scale aerial robot.
comment: Under review, 8 pages, 7 figures
Dynamics and dose response in scaffold ligand binding
This paper considers systems in which two or more ligands bind independently to a common scaffold. Such systems arise in a range of applications, including immunotherapy and synthetic biology. We show that each stoichiometric compatibility class contains a unique steady state, and that this steady state is asymptotically stable. The main result gives a rigorous proof that the steady-state concentration of the fully bound complex, viewed as a function of the total scaffold concentration, has a unique maximum. This biphasic dose response is a characteristic feature of scaffolding systems and, in the special case of two ligands, plays an important role in the design and analysis of bispecific antibody drugs.
comment: Added much more motivation, and changed title and abstract to reflect that the general case (not just the case m=3) is now treated (with basically the same treatment)
Robotics
Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning
We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos
How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.
comment: Project website: https://do-as-i-do.com/
UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning
Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.
Modeling Branches for Active Manipulation using Iterative Parameter Estimation IROS 2026
This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.
comment: Accepted to IROS 2026
Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations IROS
This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.
comment: Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.
comment: Project page: https://tttonyalpha.github.io/act2answer/
A Mixed-Reality Testbed for Autonomous Vehicles
We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.
comment: 9 pages, 7 figures, 1 table
Shape Sensing of Continuum Robots using Direct Laser Writing
Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.
comment: This work has been submitted to the IEEE for possible publication
CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems
Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.
OneCanvas: 3D Scene Understanding via Panoramic Reprojection
Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.
comment: Project page: https://baranowskibrt.github.io/onecanvas/
Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation
Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.
Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot
In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.
comment: 8 pages, 7 figures
Constant Time-Delay Leader Following with Neural Networks and Invariant Extended Kalman Filters for Arbitrary Trajectories
This paper proposes a constant time-delay trajectory tracking method for vehicle convoys operating without inter-vehicle communication, a common coordinate system, or global positioning. The method integrates a probabilistic sequence-to-sequence (Seq2Seq) neural network with an invariant extended Kalman filter (IEKF) to warm-start the prediction process, allowing accurate estimation of a leader vehicle's relative trajectory on the SE(2) manifold. A geometric model predictive controller is further incorporated to fully exploit the manifold-based trajectory predictions for improved control performance. The system can handle arbitrary nonlinear trajectories with varying speeds and motion profiles while reducing the need for expert-based domain knowledge for the design of trajectory following systems, even under long trajectory delays. The effectiveness of the method is validated through comparisons with a pure IEKF baseline, learning-based methods, and the ground-truth trajectory in kinematic simulations, as well as in experiments using real robotic vehicles.
comment: 9 pages, 6 figures
Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation
This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.
FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry IROS 2026
Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.
comment: Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise ICRA
Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization
comment: 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)
Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight
Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.
comment: 6 pages 9 figues
HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision
Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.
comment: 9pages, 4figures
Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes
Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.
comment: 33 pages, 11 figures
Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning
Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.
GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping IROS 2026
Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.
comment: Accepted to IROS 2026
ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks
Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io
ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing IROS2026
Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.
comment: Accepted at IROS2026
Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots
Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.
Congestion-Aware Robot Tour Planning in Crowded Environments IROS 2026
Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.
comment: Accepted to IEEE IROS 2026
Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation
Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.
TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer IROS 2026
Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.
comment: 9 pages, 6 figures, 4 tables, accepted into IROS 2026
Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos IROS 2026
Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.
comment: Accepted to IROS 2026
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement
Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/
comment: 8 pages, 7 figures, 2 tables; 8-page appendix
A High-accuracy Event-based Underwater SLAM System
While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.
C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors
Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.
comment: Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures
ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots
Load transportation with quadruped robots is strongly affected by the dynamics of the physical interface between the robot and the load. Passive spring-based arms reduce weight and complexity compared to active manipulators, but their spring-damper dynamics can introduce oscillatory forces that degrade locomotion stability. This paper derives an extended Zero Moment Point (ZMP) formulation that includes passive payload-interface dynamics, relating stiffness, damping, and payload mass to the stability margin. The analysis shows that underdamped configurations can resonate with locomotion harmonics. Based on this insight, we augment a Single Rigid Body Dynamics model with passive subsystem dynamics and integrate it into a Model Predictive Control framework. In simulation, the proposed controller reduces stability violations by up to $10\times$, from $7.0\%$ to $0.7\%$, and increase locomotion efficiency by lowering horizontal ground reaction force effort by up to $15\%$ compared to a nominal baseline. Hardware experiments with a $2\,\mathrm{kg}$ payload show stable locomotion under pull-release disturbances where the nominal controller fails. The same model also enables end-effector tracking through passive arm dynamics without direct arm actuation.
Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation
Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.
HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations
Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.
Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs
Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.
comment: 8 Pages, 6 Figures
Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles
The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.
Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots IROS 2026
Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.
comment: Accepted to IROS 2026, 8 pages, 5 figures
Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets
The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.
Stealthy World Model Manipulation via Data Poisoning NeurIPS 2026
Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.
comment: 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026
Spatially Stratified Distillation for Heterogeneous Radar Place Recognition ICRA
Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.
comment: IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026
High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots
In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.
A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks
Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.
comment: CCF Transactions on Pervasive Computing and Interaction
EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation
To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.
ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models
Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.
DNN Koopman-Based Deviation Compensation for UGV Path Tracking Control on Coupled Slope and Potholed Road
Unmanned ground vehicles (UGVs) operating in off-road scenarios are confronted with complex terrain disturbances that can substantially degrade path tracking performance. To address this challenge, this paper proposes a deep neural network (DNN) Koopman-based deviation compensation strategy for UGV path tracking control. Firstly, based on the vehicle dynamic function on coupled slope, an adaptive forgetting recursive least squares method with decoupled error terms is designed to estimate tire cornering stiffness. On this basis, a Laguerre model predictive control (LMPC) path tracking control strategy is designed by incorporating Laguerre functions, which can reduce computational resource usage while maintaining reliable tracking performance across different coupled slope scenarios. Then, by integrating Koopman operator theory with DNN, a DNN Koopman (DK) path deviation compensation method is proposed, which significantly improves the path tracking accuracy of UGV under potholed road disturbances. Furthermore, an event-triggered parallel cooperative (EPC) compensation mechanism that couples LMPC with DK is established based on compensation activation criteria and credibility verification. This mechanism improves path tracking accuracy on potholed road while ensuring the feasibility of overall steering command and stability of vehicle after DK compensation. Finally, a hardware-in-the-loop (HiL) experimental platform is constructed for validation. Experimental results demonstrate that the proposed UGV path tracking strategy improves tracking performance by more than 11.5% across multiple operating conditions.
comment: 22 pages, 13 figures
Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics
In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.
SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping
Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.
comment: 17 pages, 12 figures
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.
Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection
Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.
Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation
In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.
comment: 9 pages with references
DREAM-Chunk: Reactive Action Chunking with Latent World Model
Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.
Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking
LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.
Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics
The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.
comment: 5 pages, 4 figures
DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning ICML 2026
A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at https://df-expense.github.io.
comment: ICML 2026
Scaling Self-Play for End-to-End Driving
End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.
CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion
Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .
Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation IROS 2026
Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.
comment: 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026
Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification
Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.
Safe, Real-Time Active Model Discrimination and Fault Diagnosis for Nonlinear Systems via Differentiable Reachability
We present a safe, real-time algorithm for active fault diagnosis and model discrimination for uncertain continuous-time nonlinear systems with process and measurement disturbances. Given a finite set of candidate models representing nominal and faulty modes, including actuator and sensor faults, we formulate an output-feedback, time-varying policy optimization problem that (i) robustly enforces state-input safety constraints over a finite horizon and (ii) drives the system to produce sampled measurements consistent with at most one model, enabling deterministic diagnosis. To solve this problem in real time, we develop a tractable approximation using interval over-approximations of reachable state and output sets, and encode diagnosability via a differentiable objective that penalizes overlap between the reachable output sets of possible models. The resulting optimization is solved efficiently online with gradient-based methods using JAX and differentiable reachability primitives. We evaluate our method on sensor and actuator fault diagnosis (up to 11 fault modes) in several high-dimensional nonlinear robotic systems, including a simulated quadrotor and fighter-jet model, a hardware differential-drive robot, and quadrupedal navigation. Across these case studies, our approach achieves reliable model discrimination in under 50 ms, outperforming baselines in discrimination success rate and speed while providing formal safety guarantees.
One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies
Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.
comment: Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025
pdSTL: Probabilistic Differentiable Signal Temporal Logic for Stochastic Systems
Autonomous robots operating in uncertain environments must satisfy complex temporal and safety specifications despite stochastic dynamics and sensing noise. While Signal Temporal Logic (STL) offers robustness measures for gradient-based optimization, existing extensions either lack differentiability or ignore belief-space uncertainty. We introduce pdSTL (probabilistic differentiable Signal Temporal Logic), a framework that unifies probabilistic semantics with differentiable robustness over belief trajectories. pdSTL employs interval-valued probabilistic semantics to compute conservative satisfaction bounds, propagated compositionally through the STL syntax tree. We formulate the temporal robustness evaluation as a recurrent, LSTM-style unfolding of STL operators, enabling linear-time, differentiable monitoring suitable for end-to-end trajectory optimization. We validate pdSTL on simulated obstacle avoidance, lane-change maneuvers, and real-world Crazyflie quadcopter flight experiments under aerodynamic disturbances. Results demonstrate that pdSTL achieves efficient optimization with formal probabilistic guarantees, significantly outperforming deterministic differentiable STL in maintaining safety margins under real-world uncertainty.
SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation
Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, enabling whole-body collision evaluation through sparse queries in an inflated 3D occupancy map. We further introduce a projected A* search that generates collision-free guidance on an interpolated ground-following surface, with z-gradient suppression to avoid obstacles horizontally while maintaining vertical stability. For large-scale deployment, a robot-centric sliding map with boundary fallback provides high-resolution local collision checking and recovery from local dead ends. Simulation and real-world experiments demonstrate that SCAN-Planner generates safe, smooth, and efficient trajectories in dense clutter, 3D unstructured scenes, stair traversal, and long-range navigation tasks.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
comment: Project Page: https://zhangwenyao1.github.io/ImageWAM/
A Categorial and Sheaf-Theoretic Semantics for Autonomic Component Ensembles
The proliferation of large-scale, decentralized systems of autonomous agents, such as swarms of robots and networked cyber-physical systems, presents a formidable challenge to traditional formal methods. The Software Component Ensemble Language (SCEL) offers a formal model for such systems, but its operational semantics is not ideal for reasoning about global, structural, and emergent properties. This report proposes a new, multi-layered mathematical model for SCEL using category theory and sheaf theory. We argue that a society of robots described in SCEL can be formally modeled as a sheaf on a topological space, where components are points, ensembles are open sets, and distributed knowledge forms the sheaf's data. In this framework, computational processes like information sharing become equivalent to the sheaf-theoretic operation of "gluing" local data. System failures can then be understood and quantified as topological obstructions, measurable by sheaf cohomology. This approach transforms the verification of a complex distributed system into the analysis of the geometry of a mathematical object, providing deep, structural insights for the design of robust autonomic systems.
Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground
This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.
Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine
Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify simulations in multiple scenarios to demonstrate that key trends due to end effector shape, speed, and loading are preserved. Our implementation predicts walking distance and foot sinkage of a 12-Degree of Freedom hexapod robot within 20\% of experiments in sand. While RFT has inherent approximations, the open source tool described here has potential to help develop new and improved robot designs to traverse granular media substrates.
comment: 12 pages, 7 figures
3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning ICML 2026
We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.
comment: ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp
Playful Agentic Robot Learning
Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.
comment: Project page: https://playful-rats.github.io/
FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.
DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy
Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.
comment: 8 pages, 4 figures, 7 tables
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
comment: 44 pages
Why Automate This? Exploring Correlations Between Desire for Robotic Automation, Invested Time and Well-Being
Understanding the motivations underlying the human inclination to automate tasks is vital for developing robots that fit seamlessly into daily life. Accordingly, we ask: are individuals more inclined to automate activities based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across social groups, specifically gender category and income level. Leveraging data from the BEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use Survey Well-Being Module, we investigate the relationship between the desire for robot automation, time spent, and associated feelings: Happiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness. Our key findings show that, despite common assumptions, time spent on activities does not strongly predict automation preferences; instead, happiness and pain are the strongest indicators. We also identify differences by gender and economic level: Women prefer to automate stressful activities, whereas men prefer to automate those that make them unhappy; mid-income individuals prioritize automating less enjoyable and meaningful activities, while low and high-income show no significant correlations. We hope our research helps motivate the design of robots that align with user priorities, moving domestic robotics toward more socially relevant solutions. All data and an interactive tool are publicly available at https://robin-lab.cs.utexas.edu/why-automate-this/.
comment: 26 pages, 14 figures
Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation
Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.
comment: 4figures,14pages
Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions
The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.
Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives IROS 2026
MuJoCo is a powerful and efficient physics simulator widely used in robotics. One common way it is applied in practice is through Model Predictive Control (MPC), which uses repeated rollouts of the simulator to optimize future actions and generate responsive control policies in real time. To make this process more accessible, the open source library MuJoCo MPC (MJPC) provides ready-to-use MPC algorithms and implementations built directly on top of the MuJoCo simulator. However, MJPC relies on finite differencing (FD) to compute derivatives through the underlying MuJoCo simulator, which is often a key bottleneck that can make it prohibitively costly for time-sensitive tasks, especially in high-DOF systems or complex scenes. In this paper, we introduce the use of Web of Affine Spaces (WASP) derivatives within MJPC as a drop-in replacement for FD. WASP is a recently developed approach for efficiently computing sequences of accurate derivative approximations. By reusing information from prior, related derivative calculations, WASP accelerates and stabilizes the computation of new derivatives, making it especially well suited for MPC's iterative, fine-grained updates over time. We evaluate WASP across a diverse suite of MJPC tasks spanning multiple robot embodiments. Our results suggest that WASP derivatives are particularly effective in MJPC: it integrates seamlessly across tasks, delivers consistently robust performance, and achieves up to a 2$\mathsf{x}$ speedup compared to an FD backend when used with derivative-based planners, such as iLQG. In addition, WASP-based MPC outperforms MJPC's stochastic sampling-based planners on our evaluation tasks, offering both greater efficiency and reliability. To support adoption and future research, we release an open-source implementation of MJPC with WASP derivatives fully integrated.
comment: Accepted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
RSLCPP -- Deterministic Simulations Using ROS 2
Simulation is crucial in real-world robotics, offering safe, scalable, and efficient environments for developing a variety of robotic applications. While the Robot Operating System (ROS) has been widely adopted as the backbone of these robotic applications in both academia and industry, its asynchronous, multi-process design complicates reproducibility, especially across varying hardware platforms. Deterministic callback execution cannot be guaranteed when computation times and communication delays vary. This lack of reproducibility complicates scientific benchmarking and continuous integration, where consistent results are essential. To address this, we present a methodology to create deterministic simulations using ROS 2 nodes. Our ROS Simulation Library for C++ (RSLCPP) implements this approach, enabling existing nodes to be combined into a simulation routine that yields reproducible results, usually without requiring any source code changes. We demonstrate that our approach produces identical results across various CPUs and architectures when testing both a synthetic benchmark and a real-world robotics system. RSLCPP is open-sourced at https://github.com/TUMFTM/rslcpp.
comment: Accepted for publication at the 'IEEE Robotics and Automation Practice'
Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations
The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .
comment: 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)
Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition IROS
Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.
comment: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation
Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.
ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI
Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.
Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels IROS 2026
In this work, we present Tilt-Ropter, a fully actuated hybrid aerial-terrestrial vehicle (HATV) that integrates tilt rotors with passive wheels to enable efficient multi-modal locomotion. Unlike conventional underactuated HATVs, the fully actuated design of Tilt-Ropter allows decoupled force and torque control, improving maneuverability and ground locomotion efficiency. A unified nonlinear model predictive controller (NMPC) is developed to track reference trajectories, enforce non-holonomic constraints, and accommodate contact effects across locomotion modes, while ensuring actuator feasibility through dedicated control allocation. To address complex wheel-ground dynamics, an external wrench estimator is incorporated to provide real-time interaction wrench estimates. The system is validated through simulation and real-world experiments, including seamless air-ground transitions and trajectory tracking tasks. Experimental results demonstrate low tracking errors in both modes and reveal a 92.8% reduction in power consumption during ground locomotion compared to flight, highlighting the platform's suitability for long-duration missions in energy-constrained environments.
comment: 8 pages, 10 figures. Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
TurboMap: GPU-Accelerated Local Mapping for Visual SLAM IROS 2026
In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.
comment: Accepted for presentation at IROS 2026, preprint
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.
comment: Project page: https://dexsynrefine.github.io/
Mutual Adaptation in Human-Robot Co-Transportation with Human Preference Uncertainty
Mutual adaptation can enhance overall task performance in human-robot co-transportation by integrating both the robot's and the human's understanding of the environment. While human modeling helps capture humans' subjective preferences, two challenges persist: (i) the uncertainty of human preference parameters and (ii) the need to balance adaptation strategies that benefit both humans and robots. In this paper, we propose a unified framework to address these challenges and improve task performance through mutual adaptation. First, instead of relying on fixed parameters, we model a probability distribution of human choices by incorporating a range of uncertain human preference parameters. Building on this, we introduce a time-varying stubbornness measure and a coordinated planning model, which allows either the robot to lead the team's trajectory or, if a human's preferred path conflicts with the robot's plan and their stubbornness exceeds a threshold, the robot to transition to following the human. Finally, we introduce a pose optimization strategy for low-level control to mitigate the uncertain human behaviors when they are leading. To validate the framework, we design and perform a study with human feedback from twenty human participants. We then demonstrate, through simulations, the effectiveness of our models in enhancing task performance with mutual adaptation and pose optimization.
comment: 9 pages, 6 figures
DADP: Domain Adaptive Diffusion Policy
Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://outsider86.github.io/DomainAdaptiveDiffusionPolicy/.
On Feedback Speed Control for a Planar Tracking
This paper investigates a planar tracking problem between a leader and follower agent. We propose a novel feedback speed control law, paired with a constant bearing steering strategy, to maintain an abreast formation between the two agents. We prove that the proposed control yields asymptotic stability of the closed-loop system when the steering of the leader is known. For the case when the leader's steering is unavailable to the follower, we show that the system is still input-to-state stable with respect to the leader's steering viewed as an input. Furthermore, we demonstrate that if the leader's steering is periodic, the follower will asymptotically converge to a periodic orbit with the same period. We validate these results through numerical simulations and experimental implementations on mobile robots. Finally, we demonstrate the scalability of the proposed approach by extending the two-agent control law to an N-agent chain network, illustrating its implications for directional information propagation in biological and engineered flocks.
Periodic robust robotic rock chop via virtual model control
Robotic cutting is a challenging, contact-rich manipulation task where the robot must simultaneously negotiate unknown object mechanics, large contact forces, and precise motion requirements. Our hypothesis is that this complexity can be alleviated through the design of a physically structured virtual-model controller that uses switched virtual mechanisms to generate a robust, rhythmic rock-chop motion for robotic cutting, without requiring pre-planned trajectories or precise environmental information. Motion is generated by the interaction between the environment, the robot's dynamics, and the virtual forces of the switching virtual mechanism, ultimately realized through the available actuation. Through theoretical analysis and experimental validation, we demonstrate that the controlled robot behavior settles into a stable periodic motion. Experiments with a Franka manipulator demonstrate robust cuts across five different vegetables, achieving sub-millimeter slice accuracy for thicknesses from 1 mm to 6 mm at a rate of nearly one cut per second. The controller maintains high performance despite changes in knife shape or cutting board height, and successfully adapts to a different humanoid manipulator, demonstrating robustness and platform independence.
Movement Primitives in Robotics: A Comprehensive Survey
Biological systems exhibit a continuous stream of movements, consisting of sequential segments, that allow them to perform complex tasks in a creative and versatile fashion. This observation has led researchers towards identifying elementary building blocks of motion known as movement primitives, which are well-suited for generating motor commands in autonomous systems, such as robots. In this survey, we provide an encyclopedic overview of movement primitive approaches and applications in chronological order. Concretely, we present movement primitive frameworks as a way of representing robotic control trajectories acquired through human demonstrations. Within the area of robotics, movement primitives can encode basic motions at the trajectory level, such as how a robot would grasp a cup or the sequence of motions necessary to toss a ball. Furthermore, movement primitives have been developed with the desirable analytical properties of a spring-damper system, probabilistic coupling of multiple demonstrations, using neural networks in high-dimensional systems, and more, to address difficult challenges in robotics. Although movement primitives have widespread application to a variety of fields, the goal of this survey is to inform practitioners on the use of these frameworks in the context of robotics. Specifically, we aim to (i) present a systematic review of major movement primitive frameworks and examine their strengths and weaknesses; (ii) highlight applications that have successfully made use of movement primitives; and (iii) examine open questions and discuss practical challenges when applying movement primitives in robotics.
comment: 105 pages, 3 figures, and 6 tables
Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online
We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines. Our project website is available at: https://liy1shu.github.io/HAVE_CoRL25/
comment: CoRL 2025
Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands
Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.
comment: Website(overview): transferring-contact-not-just-motion.github.io/
Critique of World Model
World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation
From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.
Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis
Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl
comment: 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl
Multiagent Systems
Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents
Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.
Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play
Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.
comment: 18 pages, 8 figures
Digital Speech Acts Retain Control of Copyright with People, Not Platforms
Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a \textit{digital speech act} -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (\ia) digital speech acts qualify for copyright protection under existing U.S.\ precedent: \textit{Burrow-Giles} locates authorship in volitional creative choices despite mechanical or algorithmic processes, \textit{Feist} supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (\ib) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that ownership and possession coalesce in the person; and (\ic) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance.
A Technical Taxonomy of LLM Agent Communication Protocols
As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}
Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams
Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.
comment: 33 pages
Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents
Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.
comment: 15 pages, Figure 8
Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems
Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.
EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems
In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.
Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies
Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.
PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning
Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.
Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation IROS 2026
Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.
comment: 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026
Before the Pull Request: Mining Multi-Agent Coordination
Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.
comment: 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite
Mesh Inference: A Formal Model of Collective Intelligence Without a Center
We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal model of mesh inference.
comment: 21 pages, 2 figures
Deontic Policies for Runtime Governance of Agentic AI Systems
Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.
comment: 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services
From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability
Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.
comment: 22 pages, 7 figures, 6 tables
Agents Trusting Agents? Restoring Lost Capabilities with Inclusive Healthcare
Agent-based simulations have an untapped potential to inform social policies on urgent human development challenges in a non-invasive way, before these are implemented in real-world populations. This paper responds to the request from non-profit and governmental organizations to evaluate policies under discussion to improve equity in health care services for people experiencing homelessness (PEH) in the city of Barcelona. With this goal, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to model and evaluate the behaviour of agents who represent PEH and social workers. We define a reinforcement learning environment where agents aim to restore their central human capabilities, under existing environmental and legal constraints. We use Bayesian inverse reinforcement learning (IRL) to calibrate profile-dependent behavioural parameters in PEH agents, modeling the degree of trust and engagement with social workers, which is reportedly a key element for the success of the policies in scope. Our results open a path to mitigate health inequity by building relationships of trust between social service workers and PEH.
Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer? ICML 2026
The effectiveness of multi-agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and opinion change in multi-agent systems that captures empirically observed deliberation patterns. We show that the FJ parameters are input-dependent, turning multi-agent deliberation into a mixture of experts. This perspective implies that multi-agent systems can outperform single agents and static ensembles when routing reflects agent competence. Since competence is latent in practice, we analyze how influence is established through observable proxies: agents' self-assessed confidence, their perceived confidence, and initial alignment with other agents' views.
comment: Accepted at the 2nd Workshop on Compositional Learning at ICML 2026
The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection
In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.
TWICE: Modeling the Temporal Evolution of Personalized User Behavior via Event-Driven Agents
User simulators are widely used for data generation, evaluation, and agent-based interaction, but existing approaches often model users as static personas or rely on generic historical context, making it difficult to capture how individual behavior evolves over time. To address this limitation, we propose TWICE, an LLM-based framework for temporally grounded personalized user simulation. TWICE combines structured user profiling, an event-driven memory module organized around life events and behavioral shifts, and a two-stage workflow separating event-grounded content planning from personalized style adaptation. This design enables the simulator to model not only what a user says, but also how past experiences shape later expression. We evaluate TWICE on a large-scale longitudinal Twitter dataset and introduce a comprehensive evaluation framework that jointly measures authenticity, consistency, and humanlikeness. Results show that TWICE consistently outperforms strong baselines, suggesting that event-centered memory is a promising mechanism for modeling the temporal evolution of personalized user behavior.
ChatModel: Automating Reference Model Design and Verification with LLMs
As the complexity of integrated circuit designs continues to escalate, functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. Therefore, we introduce ChatModel, an LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 58.99% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Moreover, ChatModel accelerates the reference model design and validation cycles by an average of 7.11x over traditional manual approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.
Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning ICML2026
Multi-Agent Debate (MAD) has shown promise in improving reasoning and reducing hallucinations, yet it remains unclear how information exchange shapes individual reasoning behavior. Empirically, MAD exhibits paradoxical phenomena, including rising accuracy with increasing token entropy and marked differences between homogeneous and heterogeneous agent combinations. In this paper, we introduce a Bayesian uncertainty analysis framework for MAD, which decomposes answer-level predictive uncertainty into epistemic uncertainty and aleatoric uncertainty, corresponding to the potential gain and cost of debate. Across multiple agent configurations, we find that effective debate depends on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning algorithm that encourages lower aleatoric cost and more effective epistemic information utilization. Experiments show that our approach simultaneously enhances each agent's accuracy and promotes a more productive debate process, providing an operational Bayesian perspective for understanding and improving MAD.
comment: ICML2026
Iterative Negotiation and Oversight: A Case Study in Decentralized Air Traffic Management
Achieving consensus among self-interested agents remains challenging in decentralized multi-agent systems, where agents often have conflicting preferences. Existing coordination methods enable agents to reach consensus without a centralized coordinator, but do not provide formal guarantees on system-level objectives such as efficiency or fairness. To address this limitation, we propose a regulated decentralized negotiation framework that augments a decentralized negotiation mechanism with limited regulatory oversight. The framework builds upon the trading auction for consensus, enabling self-interested agents with conflicting preferences to negotiate through asset trading while avoiding direct disclosure of private asset valuations. We introduce an oversight mechanism, which implements a taxation-like intervention that guides decentralized negotiation toward system-efficient and equitable outcomes while also regulating how fast the framework converges. We establish theoretical guarantees of finite-time termination and derive bounds linking system efficiency and convergence rate to the level of regulatory intervention. A case study based on the collaborative trajectory options program, a rerouting initiative in U.S. air traffic management, demonstrates that the framework can reliably achieve consensus among self-interested airspace sector managers, and reveals how the level of regulatory intervention regulates the relationship between system efficiency and convergence speed. Taken together, the theoretical and experimental results indicate that the proposed framework provides a mechanism for regulated decentralized coordination that preserves noncooperative final selection while safeguarding system-level objectives.
DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.
TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law ICML 2026
Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.
comment: Accepted at ICML 2026 - AI for Law
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl
comment: 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl
Systems and Control (EESS)
A Lyapunov-Based Perspective on Absolute Stability
This article presents a unifying perspective on absolute stability concepts. In particular, it develops a Lyapunov-like explanatory framework for a nonscalar circle criterion with its small-gain and strict-passivity special cases. To this end, a general defining inequality for a Lyapunov-like function is proposed that avoids strict definiteness conditions, enabled by a strengthening of the sector constraint. We discuss different ways to derive a quadratic solution: via a linear matrix inequality (LMI), an algebraic Riccati equation, and a matrix equation. By exploiting the Kalman-Yakubovich-Popov (KYP) lemma, classical frequency-domain results are recovered. A passivity-index-based result is derived that simplifies the evaluation. Overall, the presented interrelations may be useful for both analysis and teaching.
comment: 16 pages, 7 figures; Preprint version of a manuscript accepted for at-Automatisierungtechnik (special issue 08/2026)
Integrated physics-based modeling reveals a thermodynamic gap in small modular reactor load following
Small modular reactors (SMRs) are increasingly considered for flexible power generation; however, many dynamic studies still neglect the thermodynamic coupling between the primary and secondary loops that is essential for accurate assessment of load-following capability. In this study, we develop a hybrid dynamic framework that couples an equation-based model of the NuScale integral pressurized water reactor, including the reactor, primary loop, and moving-boundary helical-coil once-through steam generator, with a physics-based secondary steam cycle comprising the valve, turbine, condenser, and feedwater pump. This approach enforces mass and energy conservation across the coupled system while preserving physically consistent flow interactions across the domain boundary. The integrated model reproduces nominal design-point conditions and is used to analyze a 5% step load rejection under five control strategies, including a decentralized three-loop control architecture for the valve, feedwater pump, and control rods. The results show that partial control strategies are insufficient for efficient and safe operation, whereas simultaneous action of all three actuators stabilizes steam pressure, limits adverse thermal excursions in the primary loop and maintains acceptable steam generator operating margins during load-following maneuvers. Compared with a conventional linear steam-cycle representation, the coupled framework captures dynamic back-pressure and variable turbine enthalpy drop that are otherwise neglected, leading to different predictions of transient behavior and required steam flow. These findings show that thermodynamically coupled, physics-based steam-cycle models are needed for more accurate assessment of the operational flexibility, efficiency and safety margins of SMRs under realistic load-following conditions.
A Mixed-Reality Testbed for Autonomous Vehicles
We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.
comment: 9 pages, 7 figures, 1 table
Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation
Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.
Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight
Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.
comment: 6 pages 9 figues
RespGeomLib: A Reproducible Parametric Engine for Generating Analysis-Ready Human Airway Lumen Geometry
CT-derived airway models support pulmonary morphometry and airflow simulation, but are often limited by distal scan resolution and the need for substantial cleanup near bifurcations. Procedural alternatives are reproducible, yet many rely on stitched tubular primitives that introduce non-smooth junctions and poorly defined open boundaries. We present RespGeomLib, a reproducible parametric engine for generating analysis-ready human airway lumen surfaces from compact YAML specifications. The framework combines port-based assembly with implicit smooth-min junction blending to produce seamless junctions, while avoiding full-tree voxelization through analytic segments and local implicit extraction around bifurcations. Quantitatively, RespGeomLib yields cleaner junctions than a Boolean/stitch baseline and is substantially faster and more memory-efficient than whole-tree global implicit extraction. We further demonstrate morphometry-guided tree generation, controlled synthetic airway variants, and CFD-ready export with stable airflow simulation. RespGeomLib targets biomedical workflows requiring reproducible morphometry, controlled synthetic variants, and simulation-ready lumen geometry. The code is publicly available at https://nichula01.github.io/Respgeomlib/
comment: Accepted to Publication at 2026 IEEE Mercon
OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems
Dynamical systems are fundamental to modeling the natural world, yet modeling them involves a persistent trade-off: manually prescribed mechanistic models are interpretable by design but often overly simplistic and misspecified; in contrast, flexible data-driven neural methods lack physical insight. Hybrid modeling aims for the best of both worlds by combining a prescribed or symbolic, physics-based component with a flexible neural network. A critical challenge, however, is that the neural component may relearn mechanistic parts, yielding redundant and uninterpretable models, especially when the symbolic structure itself is discovered from data. Existing methods based on standard $L^2$ regularization rely on a projection argument that breaks when the symbolic component is learned through sparse discovery, allowing the neural augmentation to overlap with symbolic structure. We introduce \textbf{OrthoReg} (Orthogonal Regularization), which directly penalizes overlap between the symbolic and neural components, preventing symbolic structure from being absorbed by the neural residual. This yields a complementary decomposition: the symbolic part captures what the library can express, and the neural part captures what remains. On benchmark dynamical systems with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution behavior.
Byzantine-Resilient Federated Multi-Agent Optimization Framework for Cyber-Secure Interconnected Microgrids
The escalating digitalization of distribution networks has exposed interconnected Microgrid (MG) clusters to Stealthy False Data Injection Attacks that bypass Bad Data Detectors and propagate through tie-line couplings and shared learning channels. This paper proposes BR-FedMAPPO, a Byzantine-Resilient Federated Multi-Agent Proximal Policy Optimization framework that learns a triple-surface Moving Target Defense and an adaptive isolation strategy for cyber-secure operation. Each MG hosts a local Actor-Critic Agent whose policy is partitioned into a globally federated shared encoder and a privately retained action head, so no MG exposes the configurations, cardinality, or locations of its D-FACTS lines, Battery Energy Storage (BES) units, or tie-line capacities. The action vector perturbs D-FACTS reactances, redirects BES injections, reshapes inter-MG exchanges, and includes a continuous islanding signal. A two-stage Byzantine-resilient aggregation rule combines trimmed-mean filtering with reward-weighted updates. This scheme incorporates a detection-quality score based on the F1-score and False Positive Rate to penalize clients causing false alarms. Simulation results on four interconnected MGs based on the IEEE 30- and 118-bus test systems demonstrate effective mitigation of coordinated S-FDI attacks, containment of cascading disruptions through adaptive isolation, and protection of distributed learning channels against malicious model manipulations while maintaining cost-aware dispatch performance.
Model-Free Reinforcement Learning Control for Resilient Cyber-Physical Systems
This paper compares the performance of model-free controllers on a nonlinear system under cyberattacks, including false data injection and denial-of-service attacks. Four RL reward types are analyzed for accuracy, cost, and resilience. Results show that the Lyapunov reward offers the best resilience with low tracking error. Exponential mode also provides good trade-offs with acceptable resilience under moderate training conditions. Progressive and linear rewards converge faster but are less robust. RL-MPCs show strong steady-state resilience but require longer training times; RL-PID controllers are faster with significantly less training time. Proximal Policy Optimization outperforms Deep Deterministic Policy Gradient with a significant reduction in KPI variance. This study serves to highlight how well-designed RL rewards can improve performance and resilience against cyber threats.
comment: Accepted to the 23rd IFAC World Congress 2026
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs
Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.
Riemannian Metric Preconditioning for Trajectory Tracking
We introduce a rank-one Riemannian cometric update inducing a modification of the Riemannian metric that makes specific directions of motion cheaper to travel along. We establish basic completeness properties of this reward metric, and give an explicit characterization of its Levi--Civita connection. We propose a preconditioned trajectory-tracking strategy by adding the connection-difference term to a standard intrinsic PD control, and illustrate the construction on a connection control-affine system on the Special Euclidean group with a maze navigation experiment. When the nominal trajectory is an integral curve of the vector field used to define the reward metric, our methodology improves the overall tracking, which is demonstrated through simulation results.
comment: 8 pages, 2 figures. The code can used to conduct simulations can be found at https://doi.org/10.5281/zenodo.20442932
A Generalized Sasaki Metric on the Second-Order Tangent Bundle
This paper constructs a connection map on the second-order tangent bundle induced by a linear connection on the base manifold and uses it to define a generalized Sasaki metric. The associated geodesic equations are derived, and jet-constrained variational problems are shown to yield Riemannian quintics in tension. The construction is then specialized to rigid body attitude dynamics with first-order actuator dynamics, producing an intrinsic higher-order trajectory model on the rotation group. Numerical simulations compare quintics in tension with Riemannian cubics as nominal trajectories and show modest reductions in actuator-relevant cost with comparable tracking performance.
comment: 6 pages, 2 figures. The code used to run simulations can be found at https://doi.org/10.5281/zenodo.20730224
A Finite-Gain Stability Approach to NMPC Design: the Extended Version
This paper proposes a novel approach to design of Nonlinear Model Predictive Control (NMPC) schemes based on Finite-Gain Stability (FGS) concepts. The proposed formulation considers the case where the plant is affected by unknown but bounded disturbances, which renders difficult the classical Lyapunov-based analysis/design. Based on FGS conditions for a closed-loop system, we develop a systematic NMPC design methodology, allowing us to choose the relevant NMPC parameters that lead to closed-loop FGS and provide a satisfactory tracking performance, also for the case of time-varying reference signals. A simulated example is presented to demonstrate the effectiveness of our framework, concerned with lateral/longitudinal control of an automated vehicle.
From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads
The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.
comment: 10 pages, 7 figures
Bridging Data-Driven and Model-Based Methods: A Learn-to-Optimize Architecture for Distributed Optimal Power Flow
This letter proposes a learn-to-optimize (LTO) architecture for distributed optimal power flow (D-OPF) as the nexus between data-driven and model-based methods. By unfolding alternating direction method of multipliers (ADMM) into a deep neural network (NN) and embedding differentiable optimization layers, our architecture realizes near-instantaneous interpretable distributed decision-making. For mainstream relaxed formulations of D-OPF, the decisions from our architecture achieve comparable optimality with that of state-of-the-art solvers and excelled feasibility compared with existing data-driven approaches. Comparative case studies underpin the effectiveness of our architecture regarding the optimality and feasibility.
comment: This work has been submitted to the IEEE for possible publication on May 2026
A Theory-Guided Advanced Regulatory Control Synthesis for Cooling-Limited Exothermic Semi-Batch Reactors
This paper studies theory-guided advanced regulatory control (ARC) synthesis for cooling-limited exothermic semi-batch reactors, whose productivity and thermal safety are governed by changing active constraints. Industrial ARC uses feedback loops, cascades, selectors, feedforward/override logic, and valve-position elements, but signal selection, pairing, interconnection, and tuning remain heuristic. Nonlinear model predictive control (NMPC) gives a systematic constrained-operation workflow, but requires a maintained nonlinear model, state estimator, and online optimizer. We combine finite-horizon minimum-time optimality with local safety analysis to develop a systematic analysis-to-architecture ARC synthesis workflow for cooling-limited semi-batch reactors. Under stated assumptions, the workflow translates boundary-seeking optimality into a cooling-demand valve-position-control (VPC) architecture and translates local safety requirements into near-boundary tuning rules. On a reduced benchmark and an industrial-scale polymerization, ARC is nominally competitive with an implemented nominal-model output-feedback nonlinear model predictive control (OF-NMPC) benchmark using extended Kalman filter (EKF) state estimation. In the studied adverse parameter mismatch and unmodeled fault scenarios, ARC keeps temperature-limit violation at 0%, whereas OF-NMPC either violates the limit or fails to complete the batch.
PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies
Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench-SS, a steady-state benchmark framework for evaluating tool-using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk-sensitive metrics, including submitted recall, evidence-backed recall, found recall, false-safe penalties, severity regret, residual violation score, action cost, tool-use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N-2 contingency-search pilot on deterministic IEEE 39-bus operating-point variants, with scripted baselines, an LLM JSON-command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver-only or answer-only evaluation is insufficient: agents are distinguished not only by top-contingency discovery, but also by validation-budget use, explicit submission, type coercions, duplicate validations, evidence-backed reporting, and mitigation behavior.
LQR based stabilization of an 1D heat equation with advection and memory effects
We derive a one-dimensional model for heat transfer in a moving fluid incorporating Fourier conduction, an exponentially decaying memory term, and advection under thermally insulated boundary conditions. We numerically construct a bounded state feedback law driving the closed-loop solution to zero exponentially with decay rate at least $ω>0$ for every initial state, i.e., we solve the $ω$-stabilization problem. We explicitly describe the eigenvalues of the state operator $A$, a subset of which converges to a finite negative accumulation point that sets the upper bound on the achievable decay rate. Since $A$ lacks compact resolvent, we show that the spectrum is the closure of its eigenvalues, each of finite algebraic multiplicity, and use this to verify stabilizability. For $ω$ below the accumulation bound, the problem is solvable provided the control operator $B$ satisfies a non-orthogonality condition. To compute gains, we formulate an LQR problem and solve finite-dimensional approximations: for each $n$ we construct $A_n$, $B_n$ approximating $A$, $B$ and solve the associated algebraic Riccati equation for a gain $K_n$. We show that, for all sufficiently large $n$, $K_n$ can be chosen so every eigenvalue of $A_n+B_nK_n$ satisfies $\operatorname{Re}λ<-ω$, and we establish stabilizability of $(A_n+ωI,B_n)$ uniformly in $n$. Hence, for large $n$, these gains solve the $ω$-stabilization problem for the original system. We validate the results numerically with an example.
Wind-Resilient Trajectory Optimization for UAV-BS Networks: TD3 for Continuous Service Availability
Unmanned aerial vehicle (UAV)-mounted base stations are highly susceptible to wind disturbances such as gusts and turbulence, which induce positional drift and degrade communication link quality, particularly in emergency scenarios. To address this challenge, we propose a DRL-based framework for wind-resilient trajectory adjustment and positioning based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. The method models wind as a stochastic kinematic perturbation, avoiding complex aerodynamic modeling, thereby enabling the TD3 agent to learn adaptive control policies that maintain optimal coverage footprints. By prioritizing user-centric performance metrics under turbulent conditions, the proposed architecture ensures continuous service availability despite external disruptions. Simulation results demonstrate that the TD3-based approach effectively compensates for wind-induced displacements and outperforms benchmark methods, including Proximal Policy Optimization (PPO), in terms of throughput stability and robustness in windy environments.
AI4SE and SE4AI Exploration: A Decade Looking Back and Forward
The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication's history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors' reading of the field's core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.
comment: 10 pages, 5 figure
Ramping Procurement and Bid-Cost Recovery in Real-Time Market
We study ramping procurement co-optimized with economic dispatch under net-demand uncertainty. We examine two flexible ramp product designs implemented by grid operators: single-interval and multi-interval co-optimization. Both rely on rolling-window stochastic optimization with binding and advisory interval decisions. We develop analytical frameworks to evaluate generator profits, consumer payments, bid cost recovery (BCR), and operational efficiency. In particular, net-demand uncertainty may lead to generator under-compensation, requiring discriminatory BCR. While operational efficiency is invariant to energy and ramp prices, producer profits and consumer payments depend critically on pricing. We examine locational marginal pricing (LMP) and two uniform pricing: maximum dispatch cost pricing (MDCP) and maximum temporal locational marginal pricing (MTLMP). With out-of-market BCR, LMP yields discriminatory energy prices, whereas MDCP eliminates BCR and MTLMP does so in most cases. This property enables us to establish truthful bidding incentives for price-taking generators under MDCP. Our analysis highlights trade-offs between single- and multi-interval co-optimization and pricing designs: single-interval energy-ramp co-optimization is advantageous under high forecast uncertainty and moderate ramping requirements, whereas multi-interval co-optimization is superior when net-demand forecasts are relatively accurate and ramp needs are challenging. Empirical results on CAISO and ERCOT data show that MDCP and MTLMP increase producer profits with negligible BCR, albeit at the expense of higher consumer payments relative to LMP.
comment: 4 figures
Safe, Real-Time Active Model Discrimination and Fault Diagnosis for Nonlinear Systems via Differentiable Reachability
We present a safe, real-time algorithm for active fault diagnosis and model discrimination for uncertain continuous-time nonlinear systems with process and measurement disturbances. Given a finite set of candidate models representing nominal and faulty modes, including actuator and sensor faults, we formulate an output-feedback, time-varying policy optimization problem that (i) robustly enforces state-input safety constraints over a finite horizon and (ii) drives the system to produce sampled measurements consistent with at most one model, enabling deterministic diagnosis. To solve this problem in real time, we develop a tractable approximation using interval over-approximations of reachable state and output sets, and encode diagnosability via a differentiable objective that penalizes overlap between the reachable output sets of possible models. The resulting optimization is solved efficiently online with gradient-based methods using JAX and differentiable reachability primitives. We evaluate our method on sensor and actuator fault diagnosis (up to 11 fault modes) in several high-dimensional nonlinear robotic systems, including a simulated quadrotor and fighter-jet model, a hardware differential-drive robot, and quadrupedal navigation. Across these case studies, our approach achieves reliable model discrimination in under 50 ms, outperforming baselines in discrimination success rate and speed while providing formal safety guarantees.
GDGU: A Gradient Difference-based Graph Unlearning Method for Cyberattack Localization in Electric Vehicle Charging Networks
Electric vehicle charging stations (EVCSs) can expose distribution feeders to cyberattacks. While machine learning methods, including graph neural networks, can localize which bus is compromised, significant challenges remain in data sharing and model training. For example, privacy regulations grant EVCS owners the right to delete their training data from a deployed model, yet retraining from scratch on every request is computationally prohibitive. To address this, we study graph unlearning (GU) for EVCS cyberattack localization, formulated as a feature-level unlearning problem on a graph-level multi-label classification task. Specifically, we propose gradient difference-based graph unlearning (GDGU), which removes the influence of the requested deletion data through a first-order parameter correction. The correction is computed from the gradient difference between the original training data and a modified dataset in which only the charging power features at the requested EVCS buses are unlearned. Then, a batch-normalization recalibration and a brief recovery fine-tuning step are applied to restore localization utility. We benchmark GDGU against two second-order GU baselines on the IEEE 34-bus, 123-bus, and 8500-node distribution networks across three graph neural network backbones and cumulative unlearning scenarios. GDGU matches the strongest baseline on localization utility and reaches forgetting fidelity close to full-retraining, while unlearning 10 to 12 times faster than retraining from scratch and using far less memory than the second-order GU baselines.
pdSTL: Probabilistic Differentiable Signal Temporal Logic for Stochastic Systems
Autonomous robots operating in uncertain environments must satisfy complex temporal and safety specifications despite stochastic dynamics and sensing noise. While Signal Temporal Logic (STL) offers robustness measures for gradient-based optimization, existing extensions either lack differentiability or ignore belief-space uncertainty. We introduce pdSTL (probabilistic differentiable Signal Temporal Logic), a framework that unifies probabilistic semantics with differentiable robustness over belief trajectories. pdSTL employs interval-valued probabilistic semantics to compute conservative satisfaction bounds, propagated compositionally through the STL syntax tree. We formulate the temporal robustness evaluation as a recurrent, LSTM-style unfolding of STL operators, enabling linear-time, differentiable monitoring suitable for end-to-end trajectory optimization. We validate pdSTL on simulated obstacle avoidance, lane-change maneuvers, and real-world Crazyflie quadcopter flight experiments under aerodynamic disturbances. Results demonstrate that pdSTL achieves efficient optimization with formal probabilistic guarantees, significantly outperforming deterministic differentiable STL in maintaining safety margins under real-world uncertainty.
ev-flow: A Reproducible, NHTS-Grounded Generator of Synthetic Plug-in Electric Vehicle Charging Behavior for Eight U.S. Regions
Electric-vehicle grid-integration studies need large, behaviorally realistic populations of individual charging profiles, but real charging telemetry is scarce and privacy-restricted, and the existing open generators are calibrated to non-U.S. mobility surveys or flatten the regional, seasonal, and equipment heterogeneity that drives aggregate demand. We present \texttt{ev-flow} (import name \texttt{pev\_synth}), an open-source, MIT-licensed Python package that generates synthetic plug-in electric vehicle charging behavior for eight U.S. regions, grounded in 2017 National Household Travel Survey (NHTS) microdata and regional sales-mix models. A deterministic nine-stage pipeline (M1--M9) carries each vehicle from survey records to a time-stamped charging profile: it stitches survey person-days into donor-matched 365-day travel calendars with a temperature-dependent winter energy uplift, samples behavioral plug-in start times from the published SPEECh K=16 Gaussian-mixture parameterization, evaluates a three-layer Bernoulli plug-in model, propagates a continuous-time state-of-charge ledger with an explicit PHEV gasoline range-extension term, and rasterizes plug status to 15-minute and hourly grids. The package generates residential and workplace profile types with descriptive EVSE brand and connector enrichment; every output is UTC-stored, timezone-aware, and bit-reproducible from a single master seed. A validation runner compares the generated distributions against published bounds and classifies every divergence with literature provenance: the reference \texttt{bay\_area} residential profile rolls up to 11 PASS, 0 unexplained FAIL, 6 explained failures, and 4 explained skips across 21 applicable checks. \texttt{ev-flow} fills a U.S.-focused, NHTS-grounded niche complementary to European generators such as emobpy and VencoPy and to charging simulators such as datafev and ACN-Sim.
comment: 20 pages
Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground
This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.
Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine
Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify simulations in multiple scenarios to demonstrate that key trends due to end effector shape, speed, and loading are preserved. Our implementation predicts walking distance and foot sinkage of a 12-Degree of Freedom hexapod robot within 20\% of experiments in sand. While RFT has inherent approximations, the open source tool described here has potential to help develop new and improved robot designs to traverse granular media substrates.
comment: 12 pages, 7 figures
Safe Output-Feedback Adaptive Optimal Control of Input-Constrained Control-Affine Nonlinear Systems
In this paper, a novel online, safe output-feedback, critic-only, adaptive optimal control framework is developed for safety-critical control of partially observable systems. The developed framework ensures system stability and safety, regardless of the lack of full-state measurements, while learning and implementing a near-optimal controller. The approach leverages linear matrix inequality-based observer design methods to efficiently search for observer gains for effective state estimation. Then, approximate dynamic programming is used to develop an approximate controller that uses simulated experiences to guarantee the safety and stability of the closed-loop system. Safety is enforced by adding a recentered robust Lyapunov-like barrier function to the cost function that effectively enforces safety constraints, even in the presence of state estimation errors. Lyapunov-based stability analysis is used to guarantee uniform ultimate boundedness of the trajectories of the closed-loop system and ensure safety. Simulation studies are performed to demonstrate the effectiveness of the developed method through two real-world safety-critical scenarios, specifically one ensuring that the state trajectories of a given system remain within a given set, and the other ensuring that the system avoids an obstacle.
Koopman--Nemytskii Operator of Nonlinear Controlled Systems and Its Learning for Controller Synthesis
While the Koopman operator represents a nonlinear system as a linear operator in a function space, its definition does not involve inputs. For controller synthesis, an operator model is needed to describe the effect of feedback laws on closed-loop systems, so that the desired state-feedback law can be computationally searched based on such a predictive model. To this end, this paper proposes a Koopman--Nemytskii operator, defined as a linear operator that maps canonical features of state--policy pairs in a reproducing kernel Hilbert space (RKHS) to that of succeeding states. Under regularity conditions on the dynamics and kernel selection, this operator is definable on suitable Sobolev-type RKHSs, and its data-based estimation guarantees bounded errors in single-step prediction, multi-step prediction, and accumulated cost under control. The controller synthesis problem is thus formulated as a convex kernel-based optimization one and efficiently solved in a sample-based manner.
comment: 32 pages, 6 figures, submitted to IEEE Transactions on Automatic Control after revision on 2025-10-02 and 2025-06-17
Boundary adaptive observer design for semilinear hyperbolic rolling contact ODE-PDE systems with uncertain friction
This paper presents an adaptive observer design for semilinear hyperbolic rolling contact ODE-PDE systems with uncertain friction characteristics parameterized by a matrix of unknown coefficients appearing in the nonlinear (and possibly non-smooth) PDE source terms. Under appropriate assumptions of forward completeness and boundary sensing, an adaptive observer is synthesized to simultaneously estimate the lumped and distributed states, as well as the uncertain friction parameters, using only boundary measurements. The observer combines a finite-dimensional parameter estimator with an infinite-dimensional description of the state error dynamics, and achieves exponential convergence under persistent excitation. The effectiveness of the proposed design is demonstrated in simulation by considering a relevant example borrowed from road vehicle dynamics.
comment: 12 pages, 5 figures. Under review at Automatica, 3rd review round
Translation-Symmetric Market: Enabling Incentive Compatibility For DER Aggregation
Virtual power plants (VPPs) are important for coordinating the rapidly growing portfolios of distributed energy resources (DERs) and enabling them to deliver multiple services to higher-level electricity markets. However, profit allocation procedures for VPP participants become increasingly difficult to design in an incentive-compatible manner, owing to the increased market power of DERs within each VPP relative to their direct participation in wholesale markets. In this paper, we introduce translation symmetry in electricity markets and apply it to VPP aggregation of DERs for market participation to design an incentive-compatible profit allocation method. Under the stated assumptions, we prove that this translation symmetry induces an inductive property: once incentive compatibility holds at an upper level, it propagates to the internal settlements between the VPP and its constituent DERs, thereby supporting incentive compatibility throughout the hierarchy. We further show that service prices are invariant across levels, which helps preserve competitive conditions and enables transparent value assessment. Theoretical analysis and case studies illustrate how this translation-symmetry-based approach can enable incentive-compatible profit allocation when aggregating DERs to provide multiple services.
comment: Submitted to IEEE PES ISGT Asia 2026
Information-Theoretic Meta Dynamic Programming for Signalling and Control of POMDPs
In this paper, we study the information-theoretic characterization of simultaneous signalling and control over channels modeled by partially observable Markov decision processes (POMDPs). The problem is formulated as an optimization over randomized control strategies that maximize the directed information from actions to observations, subject to an average-cost constraint. We derive a novel dynamic programming framework in which the state is defined on the space of conditional probability distributions, leading to a high-level ``meta'' dynamic program. Specifically, we show that two coupled information states, namely, the posterior distribution of the system state and a distribution over such posteriors, satisfy Markov recursions and provide sufficient statistics for optimal control. This structure enables the decomposition of optimal strategies into separated randomized policies that depend only on these information states. Our results establish necessary and sufficient conditions for optimality and unify classical stochastic control and information-theoretic formulations. In particular, we show that in the absence of signalling, the proposed framework reduces to the standard dynamic programming equations for POMDPs. The developed approach provides a principled foundation for analyzing and designing control systems with intrinsic information constraints.
comment: 8 pages, 1 Figure
Experimental Characterization Data for Battery Modules with Parallel-Connected Cells across Diverse Module-Level State of Health and Cell-to-Cell Variations
This experimental dataset presents both module-level and cell-level characterization data for lithium-ion battery modules composed of three parallel-connected inhomogeneous cells across a wide range of module-level state of health (M-SoH) and cell-to-cell variation (CtCV). First, 70 cells are aged to establish an inventory with cell-level state of health (C-SoH) ranging approximately from 100% to 80% (80% is considered as the end-of-life for automotive applications). From this inventory, 78 battery modules are then assembled, each exhibiting a distinct M-SoH value (from 100% to 80.98%) and a unique CtCV value (from 0% to 9.31%, defined as population standard deviation of C-SoH within each module). Module-level characterization data are collected under 25°C at 0.5C and 0.25C conditions, enabling extraction of module-level capacities and supporting diagnostic analyses such as incremental capacity analysis and differential voltage analysis. Before a module is assembled and tested, cell-level characterization tests are conducted for every individual cell within that module under 1C conditions, enabling direct quantification of CtCV and providing accurate labels for cell-level capacities and internal resistances. The dataset is organized with both raw time-series data and processed summary information such as C-SoH, M-SoH, and CtCV for all modules. With the paired module-level and cell-level characterization data, this dataset enables understanding and development of advanced degradation monitoring mechanisms for battery modules with parallel-connected cells in the presence of CtCVs.
comment: Addressed reviewer comments
OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem
Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.
comment: This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing
Prescribed-Time Distributed Generalized Nash Equilibrium Seeking
Safety-critical multi-agent systems, from cooperative guidance to collision avoidance, must often reach a coordinated decision by a hard deadline rather than merely converge to one eventually. This paper proposes the first fully distributed algorithm that solves the generalized Nash equilibrium (GNE) problem, a non-cooperative game with shared coupling constraints and general cost coupling, at a user-prescribed time $T$ independent of initial conditions. The foundation is a centralized, prescribed-time result built on the optimization Lyapunov function framework and implemented via unnormalized Hessian-gradient feedback, chosen because, unlike the Newton and normalized Hessian-gradient realizations, it naturally splits into per-agent computations. Distributing this feedback requires each agent to run three coupled processes simultaneously: a prescribed-time observer of the global state, a local optimization law, and a dual-consensus mechanism that enforces the shared multipliers of the variational GNE. Their simultaneous operation is the core difficulty, as the optimization continually displaces the states the observers track, while estimation errors corrupt the gradients that drive the optimization. We resolve this coupling with a multi-rate gain schedule whose observer and dual-consensus layers contract strictly faster than the optimization layer, so that every error component vanishes exactly at $T$. A Fischer-Burmeister reformulation keeps the design projection-free while enforcing the constraints at the deadline. Numerical results for a Cournot game and a time-critical sensor-coverage problem validate the approach and demonstrate its use as a solver-in-the-loop for time-critical autonomy.
comment: 12 pages, 5 figures
On Feedback Speed Control for a Planar Tracking
This paper investigates a planar tracking problem between a leader and follower agent. We propose a novel feedback speed control law, paired with a constant bearing steering strategy, to maintain an abreast formation between the two agents. We prove that the proposed control yields asymptotic stability of the closed-loop system when the steering of the leader is known. For the case when the leader's steering is unavailable to the follower, we show that the system is still input-to-state stable with respect to the leader's steering viewed as an input. Furthermore, we demonstrate that if the leader's steering is periodic, the follower will asymptotically converge to a periodic orbit with the same period. We validate these results through numerical simulations and experimental implementations on mobile robots. Finally, we demonstrate the scalability of the proposed approach by extending the two-agent control law to an N-agent chain network, illustrating its implications for directional information propagation in biological and engineered flocks.
Periodic robust robotic rock chop via virtual model control
Robotic cutting is a challenging, contact-rich manipulation task where the robot must simultaneously negotiate unknown object mechanics, large contact forces, and precise motion requirements. Our hypothesis is that this complexity can be alleviated through the design of a physically structured virtual-model controller that uses switched virtual mechanisms to generate a robust, rhythmic rock-chop motion for robotic cutting, without requiring pre-planned trajectories or precise environmental information. Motion is generated by the interaction between the environment, the robot's dynamics, and the virtual forces of the switching virtual mechanism, ultimately realized through the available actuation. Through theoretical analysis and experimental validation, we demonstrate that the controlled robot behavior settles into a stable periodic motion. Experiments with a Franka manipulator demonstrate robust cuts across five different vegetables, achieving sub-millimeter slice accuracy for thicknesses from 1 mm to 6 mm at a rate of nearly one cut per second. The controller maintains high performance despite changes in knife shape or cutting board height, and successfully adapts to a different humanoid manipulator, demonstrating robustness and platform independence.
Robotics
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement
Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.
comment: Website: https://veritas-improvement.github.io
MOCHI: Motion Enhancement of Collaborative Human-object Interactions SIGGRAPH 2026
Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.
comment: SIGGRAPH 2026 Journal (ACM TOG); Project page: https://jiyewise.github.io/projects/MOCHI/
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution ICML 2026
Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.
comment: Project Page and hi-res paper: https://research.nvidia.com/labs/sil/projects/adavomp/. ICML 2026
Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems
Conventional human-in-the-loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human-centered robotics settings, interaction should support engagement by keeping users involved in decision-making rather than limiting them to failure-driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure-driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade-off, we propose Engagement-aware MPC (E-MPC), a user-engagement-aware method that plans interaction to maintain engagement while respecting a workload constraint. E-MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E-MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real-world user study with participants with emulated mobility limitations on a robot-assisted bite acquisition system, showing that E-MPC improves user experience while maintaining task success.
comment: Project website at https://emprise.cs.cornell.edu/empc
Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So
A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $η$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $χ$; only when $χ> 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $χ$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hatχ \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $χ$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.
Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System
Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.
WireCraft: A Simulation Benchmark for Industrial DLO Manipulation
Deformable Linear Objects (DLOs), such as wires and cables, are central to industrial assembly. Unlike rigid objects, whose state is captured by a 6-DoF pose, DLOs have an infinite-dimensional configuration space and deform continuously under contact with grippers, fixtures, and the workspace, making them a demanding benchmark for general dexterous manipulation. Despite their importance, policy development and comparison remain difficult: existing benchmarks are often tied to specific hardware setups, lack modular and customizable task assets, or study generic deformable-object tasks without the fixtures relevant to real-world industrial wire manipulation. Few benchmarks align simulation, real-world data, and shared evaluation protocols. To bridge this gap, we introduce WireCraft, a simulation benchmark for industrial DLO manipulation with configurable difficulty and assets, spanning three task families: connector insertion, clip routing, and channel seating. It supports two complementary DLO physics models, articulated and deformable, and the trajectories come from both simulation and a physical UR5. We benchmark reinforcement learning (RL), imitation learning (IL), and vision-language-action (VLA) policies under shared metrics. Privileged state-based RL solves a representative setting in each task family with over 82\% success, confirming the tasks are well-posed. For connector insertion, however, the transition from reaching the socket to contact-rich alignment remains a key bottleneck for vision RL, IL, and VLA policies. These results indicate that industrial DLO manipulation, though tractable under privileged state, remains an open challenge for current vision-based learning. The benchmark, data, and tools will be open-sourced upon acceptance.
EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning
Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.
comment: 16 pages, 8 figures. Code is available at https://github.com/wanhaoniu/EAGG
A Hybrid Optimization Framework for Grasp Synthesis under Partial Observations
We propose a hybrid grasp synthesis framework that combines a learning-based Energy-Based Model (EBM) with an analytical Iterative Closest Point (ICP) method to generate robust grasps from partially observed point clouds. The learned energy function acts as a prior within a Stein Variational Gradient Descent (SVGD) framework, guiding iterative refinement of grasp configurations. Evaluated on 67 objects with 5,360 grasp attempts, our method achieves an average success rate of 60.9\%, outperforming AnyGrasp (31.1\%) and Grasp Pose Detection (48.4\%) and AS-ICP (56.6\%). These results highlight the strong generalization ability of our approach and demonstrate how combining data-driven learning with geometric optimization addresses the limitations of either strategy in isolation.
Uncertainty Quantification for Flow-Based Vision-Language-Action Models
Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation. Project website: tum-lsy.github.io/uq_vla/.
comment: Project page: tum-lsy.github.io/uq_vla/. 28 pages, 12 figures
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation
Diffusion-based visuomotor policies deployed with asynchronous inference often exhibit inter-chunk discontinuities and lack explicit mechanisms for obstacle-aware execution, leading to jerky motions and collisions that hinder reliable manipulation in real-world scenes. To address these issues, we propose LAGO Policy, a unified asynchronous action-generation framework that integrates trajectory optimization with diffusion policy for smooth and safe execution. LAGO Policy improves inter-chunk consistency via latency-aware classifier-free guidance conditioning on future actions. It further enables goal-directed collision-free trajectory planning by predicting a task-relevant interaction goal from demonstrations. Finally, spatial-temporal trajectory optimization refines the actions to be executed for low-jerk and feasible motion. Extensive real-world experiments demonstrate that LAGO Policy achieves smooth collision-free execution with high task success across challenging manipulation tasks. Project Website: https://lago-policy.github.io/
comment: 8 pages, 8 figures
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.
SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints SC 2026
In autonomous racing, fast detection of other participants' movements is required to plan safe, collision-free trajectories with non-cooperative opponents. LiDAR detection is inherently slower and harder to deploy on edge devices than vision methods, causing delayed detections that limit object tracking performance during high-dynamic maneuvering. Utilizing monocular 3D detection enables an easy-to-deploy, low-latency detection of other participants on the racetrack. We present SPARK, a single-camera pose-estimation algorithm for autonomous racing using keypoint detection. It achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency. By employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain, the algorithm also exhibits low latency and resource usage. We evaluate the performance of our approach on real-world autonomous racing data and compare it to state-of-the-art LiDAR and camera detection algorithms. The source code is available at: https://github.com/TUMFTM/SPARK-camera-det
comment: 9 pages, 6 figures, ITSC 2026, Invited Session
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.
comment: 21 pages, 2 figures. Preprint
WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT
Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.
Learn to Quantify Social Interaction with Constraints for Pedestrian Walking
Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
comment: 44 pages
From Ad Hoc Pilots to Repeatable Patterns: Structuring Drone Collaboration in Emergency Services with DroneLets
Drones hold promise for supporting emergency services, but their integration into workflows remains ad hoc and coordination-intensive. This paper addresses two research questions: how emergency teams want to collaborate with drones, and how to formalize these collaborations into repeatable processes. Based on four field trials and 95 interviews, we derive 44 interaction patterns grouped into 10 meta-patterns reflecting operational needs such as reconnaissance, communication, and logistical support. To structure these practices, we introduce DroneLets - a new class of design artifacts that extend Collaboration Engineering to embodied agents. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated actions across human and drone actors. They offer a modular framework for designing repeatable, scalable collaboration processes in emergency services, illustrated through patterns such as broadcasting to bystanders and post-fire monitoring. This work expands the scope of CE and provides a structured foundation for integrating autonomous drones into high-stakes field operations.
comment: Presented at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/is_transformwork/is_transformwork/19/
HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning
Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.
comment: 29 pages, 13 figures, 10 tables
Accountability in Autonomous Drone-Based Firefighting: Insights From a Field Trial
There is a growing research field exploring how autonomous drones can enhance emergency response effectiveness. Integrating these (artificial) agents into existing emergency teams and workflows may significantly impact established accountability relationships. This paper examines how autonomous drones affect accountability attribution within complex socio-technical systems. Drawing on two real-life field trials in firefighting, the study reveals substantial uncertainty around accountability when drones are organizationally deployed. Using Bovens' accountability framework, two challenges are identified: (1) uncertainty about the role of drones within hierarchical structures, leading to confused accountability ascriptions; and (2) new forms of human-drone interactions introducing additional accountability-relevant issues. Based on these insights, the paper proposes actionable recommendations to support the responsible integration of autonomous drones into firefighting operations without undermining accountability. These findings offer practical guidance for policymakers and contribute to further research on accountability in autonomous systems.
comment: Accepted for Publication at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/ethical_is/ethical_is/10/
ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents
Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.
comment: 14 pages, 9 figures
ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI NeurIPS
Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.
comment: under review at NeurIPS
FLAP: FOV-Constrained Active Perception Planning for Prior-Map-Free 3D Navigation
Safe and efficient trajectory planning in unknown, cluttered 3D environments constitutes a critical bottleneck for deploying Unmanned Aerial Vehicles (UAVs) in real-world applications. This challenge is further exacerbated by the limited field-of-view (FOV) and sensing range of onboard sensors. Many existing methods either make simplistic assumptions about unexplored space or rely on conservative heuristics such as speed limits or fixed perception patterns, reducing efficiency and generalizing poorly across different sensor types. In this work, we propose a novel planning framework that directly integrates active perception into trajectory optimization, thereby improving safety while preserving efficiency. The perception constraints are derived from the UAV's dynamic model and formulated in the sensor coordinate frame, which enables precise handling of FOV geometry. The velocity-triggered activation mechanism enables the planner to balance perception and motion efficiency. We introduce an active perception sub-trajectory segment with parametric start-time optimization, mitigating collision risks from late obstacle detection. Our formulation enables active perception during arbitrary 3D maneuvers, extending beyond prior methods designed mainly for horizontal motion. All constraints and penalties are incorporated into a differentiable optimization problem, so the planner requires only a simple front-end global path for guidance, rather than a computationally expensive perception-aware path generator. Extensive simulations and real-world experiments demonstrate robust performance across diverse unknown environments with varying sensor configurations.
comment: 18 pages, 19 figures
MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation
Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.
RICH-SLAM: Radar SLAM with Incremental and Continuous Hilbert Mapping
Simultaneous localization and mapping using radar sensors has gained increasing attention due to radar's inherent robustness to adverse weather and lighting conditions. However, radar measurements are characteristically sparse and noisy compared to LiDAR and visual data, posing significant challenges in achieving dense, continuous, and consistent map representations. In this paper, we present RICH-SLAM, a radar SLAM framework designed to address these challenges. Our approach features a Rao-Blackwellized particle filter-based back end that employs particle filtering for pose estimation and Kalman filtering for map updates. We propose an incremental Hilbert-space reduced-rank Gaussian process mapping strategy that enables continuous and uncertainty-aware map representations given sparse radar inputs. We further introduce a posterior-aware particle weighting scheme that leverages the full posterior distribution of map parameters for more robust likelihood evaluation. Experiments on self-collected and public ColoRadar datasets show that RICH-SLAM constructs continuous occupancy maps from sparse radar measurements and supports uncertainty-aware planning for mobile robots.
comment: 12 figures
GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments
Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.
When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning
Robots that learn over long deployments must add new skills without losing the shared policy structure that makes earlier skills reusable. We study sequential robot skill learning, where previous trajectories and task losses may be unavailable, and the deployed policy must remain a single shared controller without task-specific heads, routing, or adapters. We identify skill-coupling collapse, a failure mode in which individual skill success remains non-trivial while reliability among related skills deteriorates. We propose Sleeping Robots, a wake-sleep framework that learns each new skill during wake and consolidates the shared policy offline during sleep using compact frozen skill memories: frozen critics with unordered state buffers for reinforcement learning and frozen actor snapshots with unordered observation buffers for imitation learning. During sleep, these memories define differentiable surrogate objectives whose gradients are combined through Nash bargaining, with adaptive anchoring and local excitability for stable consolidation. On Meta-World MT5, Sleeping Robots improves average success by 64 % and pairwise reliability by x 2.0 over the strongest non-oracle baseline, and on SurgicAI it improves average success and backward transfer relative to continual imitation baselines while remaining competitive on pairwise reliability.
GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning
Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.
WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation
Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.
Embodiment Shapes Rolling Behavior in a Multimodal Infant Model
Rolling over is one of the earliest milestones in infant motor development, reflecting the emergence of coordinated, whole-body sensorimotor control. Here, we conduct a computational study of infant rolling using MIMo, a virtual infant embodiment equipped with proprioception and vestibular sensation. MIMo learns supine-to-prone rolls with reinforcement learning. Interestingly, the learned behaviors capture developmental trends and coordination patterns consistent with those reported in real infants, including improved performance and faster execution with age. Our results explain how infant capabilities and constraints can give rise to realistic behaviors in artificial agents, with a particular emphasis on how motor development is shaped by the changing body morphology. This work highlights the role of embodied computational models as a powerful tool for studying sensorimotor development.
comment: 7 pages, 7 figures. Accepted at the 2026 IEEE ICDL Conference. Cite as: L. Philipp, F. M. López, and J. Triesch, "Embodiment Shapes Rolling Behavior in a Multimodal Infant Model", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-7
Continual Online Personalization of Exoskeleton Control via Manifold-Aware Experience Replay
Personalizing exoskeleton control remains a critical challenge for clinical users with gait disabilities. Online adaptation (OA) offers an effective solution by adapting in real time to subject variability, device fit, and diverse locomotor tasks. However, OA involves a continual stream of user state data, which can lead to catastrophic forgetting of previously learned locomotor contexts. Here, we develop a manifold-aware experience replay-based online personalization framework designed to maintain user-specific representations across diverse tasks during OA of exoskeleton control. By replaying previously experienced tasks from a replay buffer, we preserve the personalized exoskeleton assistance across all learned tasks. Furthermore, we capture a gait manifold that distinguishes between different locomotor tasks, removing the need for explicit task labeling when selecting target replay bins. We evaluated our framework on emulated hemiplegic gait, which largely deviates from able-bodied patterns, across multiple forgetting scenarios with speed and incline transitions. Our manifold-aware replay framework achieved 40% and 60% improvements in torque and gait phase tracking accuracy, respectively, compared to a baseline framework without replay, which exhibited catastrophic forgetting during task transitions. This demonstrates that our proposed framework personalizes exoskeleton control in real time across diverse locomotor contexts in daily ambulation of clinical populations.
Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift
Automated Driving System deployments create a foundational ratemaking challenge: sparse experience, shifting operational design domains, and non-stationary risk across software releases. We propose a hierarchical Bayesian credibility framework pooling across cities, software versions, and territories via a learned ODD-similarity kernel, nesting Buhlmann-Straub as a limiting case. Demonstrated on 648 verified-engaged Waymo crashes across four U.S. metros from the NHTSA Standing General Order database against 116 million matched miles, city-aggregate credibility weights are moderate (0.12-0.46), partial pooling decisively outperforms no pooling, and a power analysis shows the learned kernel's advantage becomes detectable at approximately twelve deployed cities.
AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation
Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.
DexLink Hand: A Compact, Affordable, 16-DOF Linkage-Driven Hand with Human-Like Dexterity
Dexterous robotic hands face a longstanding trade-off among dexterity, compactness, and affordability. Particularly, high-degree-of-freedom designs typically demand complex actuation and transmission, hindering integration into human-scale forms. To address these challenges, this work presents a compact, low-cost linkage-driven anthropomorphic hand that achieves high dexterity, structural integration, and human-hand-like functionality. The hand integrates 20 joints driven by 16 independent actuators, with all actuation, sensing, and transmission components compactly embedded within a human-hand-sized structure. The resulting prototype weighs only 320g at a total cost below USD 400. To meet these objectives, a hybrid mechanical architecture combining planar and spatial linkage mechanisms is proposed, enabling decoupled multidirectional motion, biomimetic joint synergies, and high passive load-bearing capability. The thumb further incorporates biomimetic features supporting human-like reconfiguration and opposition movements. Through the coordinated integration of these mechanisms and structural layout, the prototype achieves a highly integrated design with anthropomorphic dexterity. Experimental evaluations demonstrate that the hand achieves the maximum Kapandji score, reproduces all 33 Feix grasp types, and performs stable grasping and dexterous manipulation across a wide variety of daily objects and tools. These results validate the proposed hand as an affordable, compact, and mechanically efficient platform for dexterous manipulation, teleoperation, and robot learning in human-centered environments.
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies
Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.
Damage Adaptation in Seconds for Architected Materials
Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at https://murpheylab.github.io/leap.
comment: Proceedings of Robotics: Science and Systems
Agent Utilities over Generalized Voronoi Regions and their Gradients
In this paper, we generalize the concept of Voronoi regions, define agent utility as the integral of a utility density over the corresponding Voronoi region, derive gradients of the utility, and illustrate the approach in a two-team example from soccer. The generalization of Voronoi regions is in the form of so-called Cost-Induced Voronoi (CIV) regions, where the agent state space may differ from the space being partitioned. One example of such regions is when the cost is given by the optimal solution of an LQR control problem. Then the agent states include position as well as velocity, while the partitioned space only includes positions. The agent utility is defined by integrating some utility density over the CIV region of the agent. This utility density might be the probability density of some beneficial event, such as receiving a pass in soccer. The utility is then the overall probability of receiving a pass and the gradient represents a way to improve that probability. We show how this utility gradient can be computed using the Reynolds Transport Theorem from fluid mechanics, and that this approach achieves similar accuracy while reducing computation time by about an order of magnitude compared to a baseline finite-difference approximation.
comment: Under review at IEEE Control Systems Letters (L-CSS)
TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations
End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.
EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning
Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.
comment: 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity
Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework
Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.
comment: 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026
AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework
AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.
comment: 50 pages, 8 figures, 10 tables
As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture
Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.
Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets
Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.
comment: 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission
N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering
Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.
RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.
comment: 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author
VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision
We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
Guava: An Effective and Universal Harness for Embodied Manipulation
Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.
Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures
Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.
comment: 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/
A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature
Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.
Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking
Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.
EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator KDD 2026
Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, EqCollide could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide
comment: SIGKDD 2026 Oral AI4S Track. 20 pages, 16 figures
Unified Motion-Action Modeling for Heterogeneous Robot Learning
We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.
comment: https://uma-manipulation.github.io/
Physical Imitation Learning: Distilling Control Policies into Passive Elasticity
Due to brain-body co-evolution, animals' intrinsic body dynamics play a crucial role in their energy-efficient locomotion. Specifically, the control effort is shared between active muscles and passive body dynamics--a principle often referred to as Physical Intelligence. As a result, the body dynamics are part of the solution. In contrast, robot bodies are typically designed to be as simple as possible, but the active control often fights the intrinsic body dynamics, resulting in low energy-efficiency. We introduce Physical Imitation Learning (PIL), a novel approach that brings current robotics control closer to animals. PIL takes learned control policies obtained with Reinforcement Learning (RL) and systematically splits them up into an active and passive control contribution. The passive part can be then directly offloaded to passive Parallel Elastic Joints (PEJs). As a result, the active control contribution is significantly reduced, lowering the overall energy consumption. Furthermore, the policy can be trained via RL to leverage the PEJ assistance by generating gaits that are more readily emulated by the PEJs. This enables co-design of the active and passive control components, shifting a greater share of actuation effort to the PEJs. Here we demonstrate the potential of this approach in simulated quadrupeds. Our results show that the proposed approach can offload up to 95% of mechanical power to passive body dynamics on flat terrain and 13% on rough terrain. PIL thereby provides a generalisable route to task-specific Physical Intelligence applicable to a wide range of joint-based robot morphologies.
Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids
Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.
comment: Accepted at IEEE ICDL 2026. 8 pages, 6 figures. Cite as: F. M. López, H. Kanazawa, O. Fiala, Y. Balashov, V. Marcel, L. Rustler, M. Lenz, D. Kim, Y. Kuniyoshi, J. Triesch, and M. Hoffmann, "Simulating infant first-person sensorimotor experience via motion retargeting from babies to humanoids'', in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-8
SSIL: Self-Supervised Imitation Learning for End-to-End Driving
In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.
comment: 8 pages, 4 figures
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
comment: 10 pages, 5 figures
CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners
End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.
comment: 8pages 4figures
SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization
Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.
comment: 17 pages, 5 figures. Submitted to IEEE J-STARS
K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots
Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.
OpenTie: Open-vocabulary Sequential Rebar Tying System
Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.
comment: This article is accepted by The 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)
TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation
Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.
comment: 14 pages, 2 figures
SimTO: A two-stage, simulation-driven topology optimization framework for bespoke soft robotic grippers
Soft robotic grippers are essential for grasping delicate, geometrically complex objects in manufacturing, healthcare and agriculture. However, existing designs struggle to grasp feature-rich objects with high topological variability, including gears with sharp tooth profiles on automotive assembly lines, corals with fragile protrusions, or vegetables with irregular branching structures like broccoli. Unlike simple geometric primitives such as cubes or spheres, feature-rich objects lack a clear "optimal" contact surface, making them both difficult to grasp and susceptible to damage. Safe handling of such objects therefore requires specialized soft grippers whose morphology is tailored to the object's features. Topology optimization offers a promising approach for producing specialized grippers, but its utility is limited by the need for pre-defined load cases. For soft grippers, these loads arise from hundreds of unpredictable gripper-object contact forces during grasping and are unknown a priori. To address this problem, we introduce SimTO, a two-stage, simulation-driven topology optimization framework that automatically extracts load cases from a dynamic, contact-rich grasping simulation before performing classical topology optimization, eliminating the need for manual load specification. Given an arbitrary feature-rich object, SimTO produces highly customized soft grippers with fine-grained morphological features tailored to the object geometry. Physical experiments confirm that our specialized grippers achieve higher grasp forces than a generalist design produced by conventional topology optimization methods, while numerical experiments show that they achieve high grasp success rates across varying object poses and strong generalization to a set of unseen objects.
comment: 15 pages, 9 figures. Published in Structural and Multidisciplinary Optimization
AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
Practical autonomous driving requires models that generalize by reasoning through spatial-temporal possibilities to exclude unsafe outcomes. While state-of-the-art (SOTA) methods use parallel planning architectures, they fail to explicitly couple speed decisions with agent behavior along the driving path, leading to suboptimal coordination. To address this, we propose a cascaded framework that transforms longitudinal planning from an independent prediction task into a path-conditioned reasoning process. On the model side, we introduce an anchor-based regression design that conditions longitudinal prediction on the lateral drive path, and reformulate longitudinal planning as 1D displacement prediction along the path. This reduces geometric uncertainty and sharpens the model's focus on interaction-driven dynamics. On the data side, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events by programmatically inserting agents and relabeling longitudinal targets to enforce collision avoidance. Evaluated on the challenging Bench2Drive benchmark, our method achieves SOTA performance with a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety. Further evaluation on Fail2Drive confirms strong generalization to rare edge cases where parallel formulations typically fail. Project page:https://yanhaowu.github.io/AlignDrive/.
comment: underreview
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/
comment: Robotics: Science and Systems, 2026
Critique of World Model: A Generative Latent Prediction Architecture for World Modeling
World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency
Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.
Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/
RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models
Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io
comment: 8 pages, 10 figures; accepted by RA-L 2026
TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation
Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/
comment: Project page: https://torl-vla.github.io/
Cosmos 3: Omnimodal World Models for Physical AI
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.
Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control
We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation
The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .
R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations ICRA 2026
Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.
comment: 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)
Bench-Push: Benchmarking Pushing-based Navigation and Manipulation Tasks for Mobile Robots
Mobile robots are increasingly deployed in cluttered environments with movable objects, posing challenges for traditional methods that prohibit interaction. In such settings, the mobile robot must go beyond traditional obstacle avoidance, leveraging pushing or nudging strategies to accomplish its goals. While research in pushing-based robotics is growing, evaluations rely on ad hoc setups, limiting reproducibility and cross-comparison. To address this, we present Bench-Push, the first unified benchmark for pushing-based mobile robot navigation and manipulation tasks. Bench-Push includes multiple components: 1) a comprehensive range of simulated environments that capture the fundamental challenges in pushing-based tasks, including navigating a maze with movable obstacles, autonomous ship navigation in ice-covered waters, box delivery, and area clearing, each with varying levels of complexity; 2) novel evaluation metrics to capture efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-Push to evaluate example implementations of established baselines across environments. Bench-Push is open-sourced as a Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.
comment: Published in CRV 2026
Multiagent Systems
DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.
On the Reliability of Networks of AI Agents: Density Evolution, Stopping Sets, and Architecture Optimization
Modern AI systems increasingly solve a task not with a single model call but with several imperfect agents working together: some propose pieces of a solution, others verify them, and the results are combined. These systems often outperform any single model, yet it is rarely clear why they succeed or when they will fail. We model such a system as message passing on a sparse graph, the structure that underlies low-density parity-check (LDPC) codes, and extend the density-evolution machinery of coding theory to this richer setting. In our model a task is a set of coupled binary subclaims, and an agent architecture is a sparse, role-typed factor graph whose check nodes are noisy Boolean verifier nodes, each computing a local Boolean function of the subclaims it touches. Three distinct failure modes, all modeled as erasures (an agent abstaining, a verifier returning no usable output, and a message lost between two agents), propagate as the agents exchange set-valued messages. The check agents combine these messages by a single logical-forcing rule that specializes to XOR, AND, OR, implication, and Horn constraints. This is more than a relabeling of LDPC theory: the verifier functions are nonlinear and value-asymmetric, and the three failure modes do not reduce to a single effective channel, so they require new threshold, finite-length, and converse results rather than a direct reuse of parity-check density evolution. We prove a density-evolution theorem that predicts the asymptotic fraction of unresolved subclaims on random role-typed architectures, with an extension to deterministic, locally tree-like graph sequences. The XOR case recovers the classical LDPC recursion on the binary erasure channel (BEC); the AND case exposes an asymmetry between positive and negative verifier certificates.
Intelligence Entropy Principle and the ADE Stability Engineering Framework
As LLM-driven multi-agent systems (MAS) transition from lab to production, system behavior exhibits nonlinear degradation. We introduce the Intelligence Entropy Principle: probability-driven systems spontaneously drift toward disorder, formalized as S(t) = S0 * exp(alpha*t/Cm), where Cm is a model capability coefficient we propose. Lyapunov analysis yields the stabilization condition lambda > alpha/Cm. We construct the ADE (Agent Delivery Engineering) four-layer framework (L1 Physical Laws through L4 User Adaptation) with 23 core components. Validation spans 100K-scale experiments and 33.6 days of production monitoring. We propose a Five-Layer Disorder Taxonomy unifying failures under structural collapse, and present Elastic Organization as an original MAS morphology. Results: channel fracture reduced from 69-98% to near 0%; system death probability below 0.02%.
comment: 32 pages, 18 figures
ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.
comment: 20 pages, 4 figures
LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI ICML 2026
AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.
comment: 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics
Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.
Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization
Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.
comment: 7 pages, 3 figures, 5 tables
ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents
Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.
comment: 14 pages, 9 figures
LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents
RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.
Empowering Economic Simulation Through Situation-Aware Llm-Driven Generative System ICASSP 2026
Traditional economic modeling typically follows a TOP-DOWN paradigm, neglecting individual diversity and the complexity of social interactions. To better capture the complexity of societal structure, Agent-Based Modeling (ABM) employs a BOTTOM-UP solution by incorporating micro-level dynamics to generate macroeconomic phenomena. Reinforcement Learning further improves its decision-making ability through tailored reward signals. However, existing ABM systems struggle to generalize beyond predefined scenarios. Recognizing the potential of LLM-driven role-playing in perception and human-like decision-making, we propose SAMAS, which models individual agents with rich macroeconomic understanding embedded in LLMs and economic trajectories experienced in the passing simulation steps. By jointly modeling both macro-level structural patterns and micro-level dynamic behaviors, SAMAS achieves superior performance in volatility realism and turning point prediction.
comment: ICASSP 2026
In-Context Environments Induce Evaluation-Awareness in Language Models
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
Tacit Coordination of Large Language Models
Large Language Models (LLMs) are increasingly deployed in multi-agent settings that require coordination without communication, from human-AI interaction to safety-critical scenarios. Humans often overcome the absence of communication through focal points: salient solutions that naturally stand out to all participants. We present the first large-scale evaluation of how, when, and why focal points emerge in LLMs, comparing their behaviour with humans across cooperative and competitive games, including realistic search and rescue scenarios, demonstrating when focal points enable effective coordination. Across more than 20 open- and closed-source models, we find that LLMs exhibit a remarkable ability to coordinate without communication, often matching or outperforming humans. However, the same models consistently fail in tasks requiring numerical common sense or culturally nuanced notions of salience. We additionally evaluate simple learning-free strategies that substantially improve coordination both among LLMs and between humans and LLMs. Our results reveal striking coordination capabilities, as well as social limitations in modern LLMs, and offer new insight into the latent notions of salience encoded within them. Our findings caution against assuming that LLMs share humans' cultural and perceptual substrate when deployed in coordination settings.
comment: Code: https://github.com/EmanueleLM/focal-points
SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.
A Generalized Sinkhorn Algorithm for Mean-Field Schrödinger Bridge
The mean-field Schrödinger bridge (MFSB) problem concerns designing a minimum-effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schrödinger bridge, the dynamical constraint for MFSB is the mean-field limit of a population of interacting agents with controls. It serves as a natural model for large-scale multi-agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf-Cole transform for MFSB and, building on it, design a Sinkhorn-type recursive algorithm to solve the associated system of integro-PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.
Information-Theoretic Measures in AI: A Practical Decision Guide
Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.
comment: 25 pages, 2 tables, 1 figure. Submitted to Entropy (MDPI)
R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations ICRA 2026
Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.
comment: 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)
Self-Evolving Multi-Agent Systems via Textual Backpropagation
Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.
Systems and Control (EESS)
Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents
With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.
Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds
We study finite-horizon queue peaks in generalized switches, a standard stochastic-network model in which many queues share constrained service resources. Arrivals may be dependent, time-varying, and adapted to the past; the standing load condition is uniform interior slack, meaning the conditional mean arrival vector stays in a fixed contraction of the capacity region. We show that this slack reshapes the finite-time peak law for drift-minimizing scheduling policies such as MaxWeight. The square-root envelope that is sharp without slack persists only up to a geometry-dependent threshold; beyond that threshold, the running maximum grows only logarithmically with the horizon, both with high probability and in expectation. The mechanism is self-normalization: in the current queue direction, the projected fluctuation scale is normalized by the stabilizing drift scale. This removes capacity geometry from the logarithmic coefficient, while geometry remains in the threshold. Matching lower bounds show that both the logarithmic term and a geometric threshold are unavoidable. When finite-time state-space collapse is available, the threshold can be sharpened using local bottleneck geometry. For generalized input-queued switches, we obtain finite-time peak bounds with tight logarithmic coefficients. Simulations illustrate the two-phase envelope, local geometric refinements, and variance-sensitive improvements predicted by the theory.
Decentralized Decision-Making for Finite-State Systems over Finite Alphabets is Undecidable
This paper investigates decentralized decision-making for finite-state transition systems, i.e., discrete-event systems, under finite communication alphabet constraints. We consider a general decentralized observation framework in which a plant is observed by multiple local agents that transmit symbolic messages over a finite alphabet to a memoryless fusion center. The fusion center then produces a binary decision according to a prescribed fusion rule. We study the fundamental question of whether there exist local decision maps that enable exact reconstruction of a given regular specification language from decentralized observations. Contrary to classical results that rely on specific monotone fusion rules such as conjunction and disjunction, we show that the problem becomes undecidable even under a severely restricted information architecture: binary local decision alphabets and a fixed exclusive-or (XOR) fusion rule. The proof is based on a reduction from the Thue word problem, a classical undecidable problem in rewriting systems. We further show that decentralized supervisory control, decentralized fault diagnosis, and decentralized fault prognosis are also undecidable under finite communication alphabets. Our results reveal that existing decidability results fundamentally rely on structural properties of fusion rules, in particular their monotone order-preserving nature. In contrast, non-monotone fusion rules such as XOR break this structure, leading to undecidability even in highly restricted settings.
Verifiable computations for dynamic encrypted control
Encrypted control can preserve the privacy of data and parameters while the necessary computations can be outsourced to a cloud server. To ensure the integrity of the received values from the cloud, i.e., that they have not been changed, however, strong assumptions or verification algorithms are needed. Previous methods require computationally expensive cryptographic protocols or are only applicable to static computations. In this paper, we present a novel type of verification algorithm for linear dynamic encrypted control. We utilize system-theoretic input-output properties of the controller for artificial challenge signals, which are processed in the cloud in parallel with the requested control input, to check the correctness of the results at the plant. This results in almost no additional computational load, wrong computations are revealed with high probability, and no replay attacks are possible.
comment: Accepted for presentation at the 23rd IFAC World Congress 2026
Reducing Building Heat Demand Through Intelligent Control: A Comparative Simulation Study
Space heating remains the dominant energy consumer in buildings. While structural retrofitting can substantially reduce demand, it is often costly and time-intensive. As an alternative, this study investigates the potential of intelligent heating control strategies to reduce heat consumption with lower investment and faster implementation. Previous studies have shown that replacing conventional heating-curve-based controllers with model predictive controllers (MPCs) can reduce heating energy demand. Whereas most studies compare MPC to conventional control, this work evaluates two MPC strategies with different control objectives and quantifies their impact on indoor temperature tracking and heating demand. A virtual residential building model was developed in Python based on ISO 52016-1 to generate synthetic measurement data. A simplified resistance-capacitance (RC) model was parametrised using this dataset and used as the internal model for two MPC strategies implemented in MATLAB. The strategies differ only in their optimisation objective: one minimises quadratic heating power, while the other prioritises indoor temperature tracking for thermal comfort. Simulations over six days show that both strategies satisfy comfort and system constraints, but differ in energy use and temperature variation. The comfort-oriented controller achieves lower total heat consumption than the controller minimising heating power, which is attributed to the penalisation of high heating rates in the quadratic objective function. The results demonstrate the importance of objective function formulation in MPC design and show that high comfort levels can be maintained while achieving lower heating demand without structural modifications to the building envelope.
comment: 9 pages, 5 figures, 1 table. REHABEND 2026, 11th Euro-American Congress
Three-phase model of unbalanced distribution networks with DERs
Classical DistFlow equations for steady-state distribution network analysis fail to capture the inherent imbalances of three-phase systems arising from asymmetrical lines, loads, and distributed energy resources (DERs). This paper extends the classical power flow (PF) equations into a rigorous, non-approximated three-phase formulation, termed Dist3Flow. The proposed branch flow model (BFM) utilizes the real and imaginary components of nodal voltages and the active and reactive power flows as state variables. Lines are modelled by nonlinear forward and backward equations, while loads and DERs are represented via ZIP models and P-Q control, respectively. By incorporating specific boundary conditions at the terminal nodes, the formulation generalizes PF analysis to both radial and closed-ring topologies. The solution is obtained by using a backward/borward sweep (BFS) algorithm. The approach is validated against OpenDSS across various configurations, considering open-ring and closed-ring topologies with and without DERs.
Model-Free Control for Multi-Time Scale Dynamics of Grid-Connected Power Converters
Controller synthesis in power electronics-based systems depends predominantly on the mathematical model of the system, which is a limitation when the actual system is complex and the mathematical model cannot capture all its dynamics. Model-free control addresses this limitation by using an ad-hoc simple model which is compensated by high-rate evaluation of dynamics in terms of their derivatives. However, application of the model-free control strategy to power electronics-based multi-time scale dynamical systems is challenging because of the derivative action needed to implement such control. Grid-connected power converters are examples of such systems, yet experimental validation has not been adequately addressed in the literature. This letter presents the validation of such control including the hardware implementation level. An intelligent proportional-integral (iPI) controller is synthesized and validated on a 16 kW experimental test bench. This proves the benefits of the approach in control of grid-connected power converters, among which their participation in the secondary voltage control.
Information-Theoretic Meta Dynamic Programming for Signalling and Control of POMDPs
In this paper, we study the information-theoretic characterization of simultaneous signalling and control over channels modeled by partially observable Markov decision processes (POMDPs). The problem is formulated as an optimization over randomized control strategies that maximize the directed information from actions to observations, subject to an average-cost constraint. We derive a novel dynamic programming framework in which the state is defined on the space of conditional probability distributions, leading to a high-level ``meta'' dynamic program. Specifically, we show that two coupled information states, namely, the posterior distribution of the system state and a distribution over such posteriors, satisfy Markov recursions and provide sufficient statistics for optimal control. This structure enables the decomposition of optimal strategies into separated randomized policies that depend only on these information states. Our results establish necessary and sufficient conditions for optimality and unify classical stochastic control and information-theoretic formulations. In particular, we show that in the absence of signalling, the proposed framework reduces to the standard dynamic programming equations for POMDPs. The developed approach provides a principled foundation for analyzing and designing control systems with intrinsic information constraints.
comment: 8 pages, 1 Figure
A Wearable Multimodal Ultrasound+Inertial System for Real-Time Virtual Reality Interaction
A-mode ultrasound (US) is a promising sensing modality for Virtual Reality (VR) interaction, as it enables the mapping of muscular activity into control commands while retaining the benefits of wearable sensing. However, existing approaches still face limitations in terms of wearability and interaction complexity, often relying on external hardware such as cameras. In this work, we propose a fully wearable multimodal interface for real-time VR-interaction, based on concurrent US and inertial (accelerometry) sensing from the forearm and upper arm. The system is built on the WULPUS platform and integrates an end-to-end software framework for real-time acquisition, visualization, and communication with a Unity-based VR environment. A multimodal learning pipeline is introduced for concurrent hand pose and forearm position estimation in 2D space. The interface is evaluated through offline and online experiments with five subjects, during the execution of three functional tasks: cylinder grasping (gross motor) and relocation, marble pinching (fine motor) and relocation, and liquid pouring. For offline experiments, we collect 5 acquisition sessions across multiple days, achieving an average inter-session accuracy across subjects of 80$\pm$6\% for hand pose estimation and 77$\pm$7\% for forearm position estimation. Online validation with minimal fine-tuning (5 min) demonstrates success rates of 92.0$\pm$16.0\%, 88.0$\pm$9.8\%, and 96.0$\pm$8.0\% for the three tasks, respectively. With a power consumption of only 19.9~mW, our system enables more than 2.5 days of continuous use on a small 350 mAh LiPo battery without the need for recharge, enabling truly wearable, multimodal, and functionally meaningful VR interaction.
comment: 8 pages, 8 figures, 3 tables
Low-Thrust Orbital Differential Games with Speed Constraint Enforcement Using CostWeighting
This paper considers the problem of a low-thrust spacecraft pursuit-evasion differential game with an arbitrary terminal relative speed constraint. It addresses the terminal phase of the engagement for two relatively close spacecraft near a circular orbit. The problem is formulated as a linear-quadratic zero-sum differential game, with soft constraints on the terminal relative position and velocity, and running costs on the players' control efforts. An analytical, closed-loop, minimum-fuel-consumption optimal guidance law is derived for each player, forming a saddle-point solution. It is proven that any terminal speed can be achieved by properly choosing the weighting parameters of the cost function. To verify the optimality of the solution, a conjugate point analysis is performed when the cost function velocity weighting matrix is either positive or negative definite. The negative-definite case arises at high terminal speeds and is seldom seen in the literature. The performance of the derived guidance law is evaluated in simulations for different target maneuvers and compared to a state-of-the-art optimal-control-based guidance law. The simulations show that the derived guidance law satisfies the constraints and offers a substantial advantage over the optimal-control-based guidance law when the target is optimally evading.
comment: This work was submitted for journal publication. 22 pages and 9 figures
Dynamic Analysis of Centralized Energy Storage Systems -- A Comparison between Grid-following and Grid-forming Controls
This study investigates the small-signal stability of centralized energy storage systems (CESSs) using grid-following (GFL) and grid-forming (GFM) controls, particularly focusing on bidirectional power flow and multiple energy storage systems (ESSs). To address the issue of complex dynamics in CESSs when comprehensive GFL and GFM control loops are considered, high-order dynamics are simplified using the virtual damping method by focusing on the dominant oscillation mode. Damping analysis verifies that CESSs using a single-type control (either GFL or GFM) have dynamic superimposition characteristics. Specifically, as ESS number increases, the damping of GFM-CESSs improves but that of GFL-CESSs decreases. The damping sensitivity shows that the damping of GFM-CESSs is more sensitive to bidirectional power flow and all control loops, whereas that of GFL-CESSs is more sensitive to d-axis control loop. Consequently, GFM-CESSs are preferred for large-scale integration but are limited in scenarios with significant power reversal. If GFL and GFM controls are hybridized in CESSs, the ratio of GFM-CESSs should be constrained to avoid instability from modal resonance between GFL-CESSs and GFM-CESSs. This highlights that implementing GFM-CESSs necessitates considering scenario limitations rather than pursuing maximal integration under hybrid integration conditions. The conclusions are validated through modal analysis and time-domain simulations.
comment: This paper has been accepted for publication in IEEE TRANS POWER SYSTEMS, 2026. The final version of record will be available via IEEE Xplore
When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts
Learned dynamics models often answer global physical questions, such as fault severity or impact stiffness, by pooling a per-step feature sequence into one readout vector. This sequence-to-global interface creates an under-studied temporal credit problem: with only trajectory-level supervision, a model can predict accurately in training conditions while reading from abundant smooth correlates rather than the brief physical events that determine the target. We call this failure temporal credit dilution. It is not exposed by the training loss and is not removed by standard physics-informed residuals, because the error lies in where the global readout assigns functional credit. We introduce Credit-in-Event, an interface-level probe for measuring how much pooled credit lands on event steps, and prove in closed form that a pooled linear reader routes credit to a spurious background channel as the event fraction shrinks. We then propose CREST, a training-free and label-free readout that estimates a transient event core from learned features and re-anchors the pooled representation through event-versus-rest contrast. Across simulated gear and impact systems, recurrent and attention encoders, and public bearing vibration data, CREST reduces out-of-distribution error while restoring event credit. Ablations show that stable-step selection and receptive-field shrinking fail, confirming that the gain comes from event-core credit re-anchoring rather than a generic locality or stability prior.
comment: 7 pages, 6 figures
Instability Caused by Integration of IBRs under Strong Grid Connections -- A Practical Case Study on Large-scale Energy Storage Systems
It has been well known that inverter-based resources (IBRs) can lead to converter-driven stability issues under weak grid connections. However, as the number of IBRs increases, instabilities can also occur even under strong grid connections. A practical case is presented to demonstrate this conclusion, using large-scale energy storage systems (ESSs) as an example. In this study, the ESSs induce oscillations with a frequency of 150 Hz in the d-q coordinates while providing both capacitive and inductive reactive power support (achieved by ESS functional control loops) to the connected power system. Theoretical analysis reveals that under strong grid connections, the dynamic interactions among power conversion systems (PCSs) of ESSs can be superimposed and intensified as the ESS scale extends, which reduces oscillation damping and leads to system instability. This indicates that ESS functional control loops also have potential instability risks when providing supports to power systems, which should be carefully examined. Finally, major impact factors are identified to mitigate the oscillations, and the conclusions are validated based on the SIMULINK platform. This paper provides valuable practical insights into system instabilities even under strong grid conditions, emphasizing the importance of functional control design and careful planning of the scale for IBR-dominated systems.
comment: This paper has been accepted for publication in IEEE TRANS POWER SYSTEMS, 2026. The final version of record will be available via IEEE Xplore
Stability Analysis in Large-scale Centralized Bidirectional Inverter-based Stations Connected to Bulk Power Systems through AC and DC Connections
Massive controlled DC resources (CDCRs), such as battery energy storage systems, are connected to AC power systems through bidirectional inverters for power balance requirements. This study investigates converter-driven stability (CDS) issues in the sub-synchronous frequency range caused by large-scale bidirectional inverter-based stations (IBSs). The impacts of the AC and DC connections of IBSs on subsynchronous oscillations (SSOs) are compared by examining three factors: the number of CDCRs, power flow direction, and control parameters of the inverters. For AC connections, IBSs may induce instability as the number of CDCRs increases, regardless of the power flow direction. To maintain stability, the maximum power amplitude of the IBS is calculated. It is found that switching to DC connections can reduce these instability risks if the DC line resistance is much less than the AC line reactance. Moreover, the method of tuning control parameters is demonstrated to be more effective in improving power-related critical stability under DC connections. Therefore, The DC-IBS is preferred for high-voltage transmission. Finally, the conclusions are validated in power systems connected with both AC- and DC-IBSs under various network topologies and system scales.
comment: Accepted for publication in IEEE TRANS POWER SYSTEMS, 2026
Anywhere, Any-Stymie: Remote Activation of Trojan Malware on LiDAR with Modulated Signals
LiDAR sensors are widely deployed in autonomous systems for 3D perception and safety-critical decision-making. We identify a previously unexplored attack surface in which dormant malware embedded in the LiDAR sensing pipeline remains inactive during normal operation and can be externally triggered after deployment, without requiring access to sensor hardware or networking at attack time. To operationalize this threat, we design malware capable of low-level point-cloud manipulation and embed it into LiDAR firmware. This malware was developed in a closed research test environment with vendor technical support, rather than by exploiting an inherent production supply-chain vulnerability. To selectively trigger attack activation, we design and implement an optical trigger that remotely activates the malware by delivering a modulated signal into the sensing environment. Once triggered, the malware performs real-time point cloud manipulation, and we demonstrate false object injection and real object suppression on static and mobile victim platforms. Our evaluation first establishes attack feasibility, including static operation at 300~ft and recorded drive-by runs reaching 35~mph. We then illustrate quantitatively that injected person-like artifacts can remain semantically detectable by a state-of-the-art 3D object detector. Finally, we demonstrate multiple modes of safety-critical impact on a deployed tactical autonomous vehicle. Together, these results highlight the need for stronger integrity guarantees throughout the LiDAR sensor development and deployment pipeline.
OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem
Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.
comment: Ongoing project paper that is going to be submitted to JointCloud Computing
Data-Driven Stabilizing Controller Design for Linear Infinite Networks
We propose a direct data-driven method for controller synthesis of infinite networks composed of unknown linear time-invariant subsystems. Using a single set of noise-corrupted input-state trajectories collected from each subsystem, and provided that certain linear matrix inequalities hold, each subsystem is rendered exponentially input-to-state stable (eISS) by locally constructing an eISS control Lyapunov function together with an exponentially input-to-state stabilizing feedback controller. We then compose these local components under a compositional small-gain condition in infinite-dimensional spaces to obtain a global control Lyapunov function and an associated stabilizing controller, ensuring uniform global exponential stability of the infinite network. The approach is validated on a physical case study with unknown dynamics.
comment: This paper has been accepted at the 27th International Symposium on Mathematical Theory of Networks and Systems (MTNS)
ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
Traditional CPU, GPU, and NPU architectures are increasingly limited by the von Neumann bottleneck. While In-Memory Computing (IMC) using ReRAM crossbar arrays offers a high-density, energy-efficient alternative, its practical deployment is constrained through their non-idealities. Existing hardware-aware training frameworks often require training from scratch, which is computationally prohibitive for modern large-scale models. In this work, we propose a finetuning-based hardware-aware training algorithm that enables robust DNN deployment on ReRAM with minimal training overhead. Our approach mitigates I-V non-linearity by applying a range-shrunk sinh transformation and incorporates retention errors directly into a regularization loss during the finetuning process. We evaluate our framework across models and tasks such as image classification and question-answering (QA). Experimental results demonstrate that our method achieves similar accuracy on large-scale models like ResNet18 and DeiT-Tiny as the base model. In-case of ImageNet for MobileNetV3 families the technique has only less than 2% accuracy degradation. Further, applying the technique on the SQuAD v2 dataset results in only 1 point degradation of F-1 score.
comment: 11 pages, 12 figures, 2 tables, with appendix (5 pages, 9 figures)
Perron--Frobenius Operator Matching for Generative Modeling
We introduce Perron--Frobenius Operator Matching (PFOM), a generative framework that matches density evolution via the integral PF operator, subsuming flow, diffusion, and jump models. We prove that among Bregman divergences, only Kullback--Leibler divergence preserves equality between density-level and sample-conditioned objectives, yielding a practical loss equivalent to Koopman path matching. We further develop Nesterov-accelerated training and sampling that stabilize discretization and accelerate convergence. %On Gaussian mixtures and two-moons, PFOM achieves faster KL/$W_2$/MMD decrease and improved wall-clock efficiency with empirical validation. PFOM unifies operator-theoretic identification with modern generative modeling and opens paths to adaptive dictionaries and high-dimensional applications.
Agent Utilities over Generalized Voronoi Regions and their Gradients
In this paper, we generalize the concept of Voronoi regions, define agent utility as the integral of a utility density over the corresponding Voronoi region, derive gradients of the utility, and illustrate the approach in a two-team example from soccer. The generalization of Voronoi regions is in the form of so-called Cost-Induced Voronoi (CIV) regions, where the agent state space may differ from the space being partitioned. One example of such regions is when the cost is given by the optimal solution of an LQR control problem. Then the agent states include position as well as velocity, while the partitioned space only includes positions. The agent utility is defined by integrating some utility density over the CIV region of the agent. This utility density might be the probability density of some beneficial event, such as receiving a pass in soccer. The utility is then the overall probability of receiving a pass and the gradient represents a way to improve that probability. We show how this utility gradient can be computed using the Reynolds Transport Theorem from fluid mechanics, and that this approach achieves similar accuracy while reducing computation time by about an order of magnitude compared to a baseline finite-difference approximation.
comment: Under review at IEEE Control Systems Letters (L-CSS)
Performance-Driven Environment Abstraction with Multi-Timescale Learning
We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.
Stability of Slow-Fast Nonlinear Dynamics: Non-Periodic Case
We present sufficient conditions for the semi-global exponential stability of nonlinear systems whose dynamics have both slow and fast time variations. Unlike most existing results, the fast variation is non-periodic, thereby allowing a wider class of systems, especially switched systems with fast (non-periodic) switching and those with quasi-periodic variations; we therefore rely on general averaging to construct an average system. It is assumed that the average system admits a time-invariant equilibrium that is globally exponentially stable when the slow variation is frozen, i.e., remaining at a fixed value. This slow variation is allowed to be discontinuous in time, provided its total variation (flows and jumps) is bounded. The main result is illustrated using a nonlinear switched system with slow-fast non-periodic switching.
comment: 27th International Symposium on Mathematical Theory of Networks and Systems (MTNS), Waterloo, ON, Canada, Aug. 2026
Constellation-Level Power Allocation for LEO Space-Based Solar Power
Space-based solar power (SBSP) has recently gained renewed attention as an appealing technological advancement for providing continuous clean energy using space-based infrastructure. However, the potential of low-Earth orbit (LEO) satellite constellations for SBSP remains largely unexplored and lacks detailed simulation-based studies. In this paper, we introduce a novel LEO SBSP system model and conduct a 24-hour system-level simulation of a Walker $4\times 5$ LEO SBSP constellation at an altitude of 450\,km, beaming 2.45\,GHz microwave power to eight ground stations (GSs) under a greedy allocation policy. The model includes orbital propagation, eclipse cycles, the satellite power chain, Goubau--Brown beam coupling, ITU-R P.618 atmospheric attenuation, and onboard battery dynamics. The results confirm that the peak DC power delivered reaches 1.986\,MW, while the mean per-site delivery at the served GS ranged from 40 to 75\,kW. Two of the eight GSs received no service during the run, as their passes were consistently ranked lower under the greedy policy than competing links at the same step. The incident peak power density (PD) at the rectenna remained within the 3.35--5.72\,W/m\textsuperscript{2} range, below the International Commission on Non-Ionizing Radiation Protection (ICNIRP) general-public exposure limit. For a 20-satellite Walker LEO at this altitude, realistic per-site delivery is 50--100 kW, and the rectenna should be sized to the operational incident PD of order 5,W/m\textsuperscript{2} rather than to a Geostationary Earth Orbit (GEO)-era 100,W/m\textsuperscript{2} rating.
Deep-Learning-Based Pixelated Microwave Filter Design and Characterization using Electro-Optical Electric-Field Measurements
Traditional microwave filter design typically relies on iterative parameter tuning and predefined topologies, which limits design space and increases development time. This study uses a deep learning approach combining convolutional neural networks with genetic algorithms to automate pixelated microwave filter synthesis. To validate the approach experimentally, both S-parameter and spatial electric-field measurements were analyzed. The synthesized low-pass filter demonstrated excellent agreement between simulated and measured performance, achieving a 7 GHz passband with over 20 dB suppression beyond 9.5 GHz. Electro-optical measurements, for the first time, revealed electric field patterns that resemble coupled transmission-lines or stub structures, providing insight into the emergent characteristics of AI-generated designs.
Deep Learning-Driven Inverse Design of Doherty Power Amplifiers Using Pixelated Combiners and Dual-State Impedance Synthesis
The output combiner of a Doherty power amplifier (PA) integrates load modulation, impedance matching, and phase compensation within a single network, making its design and synthesis highly challenging. In this paper, we propose a three-port Doherty combiner design methodology that combines deep convolutional neural networks (CNNs), pixelated layout representations, and genetic algorithms (GA) with dual-state impedance synthesis to address both peak and back-off power conditions. As a proof of concept, two GaN HEMT Doherty PA prototypes incorporating three-port pixelated combiners are designed and fabricated. Both prototypes achieve a measured saturated output power exceeding 44.2 dBm with peak drain efficiency above 71.2% within 2.6-2.8 GHz. Furthermore, a drain efficiency as high as 64% is measured at the 6-dB back-off level. After applying digital predistortion, each prototype achieves an adjacent channel leakage ratio (ACLR) better than -51.3 dBc.
Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation
Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25° CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.
Dynamic Resource Allocation with Karma: An Experimental Study
We perform a behavioral experiment of karma, a class of mechanisms for repeated resource allocation with attractive fairness and efficiency properties, in theory. Individuals in these mechanisms bid non-tradable credits that flow from resource consumers to yielders, like karma. Human subjects recruited on Amazon MTurk are repeatedly and randomly paired to bid karma according to time-varying and stochastic individual preferences or urgency to acquire resources. Treatments varied in the dynamic urgency process (frequent moderate urgency versus sporadic high urgency) and the richness of the bidding scheme (binary versus full range). Results are benchmarked against random allocation, and karma achieves a (almost) Pareto improvement over random, despite the MTurk subjects deviating significantly from the theoretically optimal Nash bidding policy. Maximum improvement is attained by subjects that deviate from Nash by up to one karma bid unit on average, and positive improvement is attained with average deviations of up to 3-4 bid units. These findings hold across all treatments, among which no significant differences are found, with the exception of the sporadic high urgency process with binary bidding treatment being (weakly) favorable over others. These results offer behaviorally robust lower bounds for the expected performance of karma in human populations. They also provide guidance for future testing and implementation of karma mechanisms in the real world.
Automating the Wildfire Detection and Scheduling Pipeline with Maneuverable Earth Observation Satellites
Wildfires are becoming increasingly frequent, with potentially devastating consequences, including loss of life, infrastructure destruction, and severe environmental damage. Low-Earth-orbit satellites equipped with onboard sensors can capture critical information relative to active wildfires and enable near-real-time detection through machine learning algorithms applied to the acquired data. We propose a framework that automates the complete wildfire detection and satellite scheduling pipeline, entitled the WildFire-applicable Intelligent and Responsive Ensemble for Detection and Scheduling (WildFIRE-DS). This paper develops an algorithm to realize the vision of the WildFIRE-DS as a proof of concept, integrating three key components: wildfire detection in satellite imagery, statistical updating that incorporates data from repeated flyovers, and multisatellite scheduling optimization. The algorithm enables wildfire detection using convolutional neural networks with sensor fusion techniques, incorporates subsequent flyover information via Bayesian statistics, and schedules a constellation of satellites using the state-of-the-art Reconfigurable Earth Observation Satellite Scheduling Problem. Simulated experiments conducted using real-world wildfire locations and the orbits of operational Earth observation satellites to demonstrate that this autonomous detection and scheduling approach effectively enhances wildfire monitoring capabilities.
comment: 46 pages, Journal of Aerospace Information Systems (Accepted)
Adaptive Modular Geometric Control of Robotic Manipulators
This paper develops an adaptive modular geometric control framework for robotic manipulators with uncertain inertial parameters. The manipulator is decomposed into rigid-body and joint modules, where each rigid-body module is represented by an Euler-Poincaré-type spatial dynamics on the Lie algebra se(3), and configuration errors are defined intrinsically through logarithmic maps on SE(3). The joint modules impose local screw constraints that relate adjacent body twists, accelerations, and transmitted wrenches, yielding a recursive propagation structure for the interconnected multibody system. Within this formulation, local geometric control laws are constructed at the module level, while the interconnection among modules is characterized by power-conjugate twist--wrench pairs induced by the natural duality pairing between the Lie algebra se(3) and its dual space se(3)^*. For the nominal case, exponential tracking stability of the interconnected system is established using local configuration energy functions on SE(3) and the power-preserving structure of the modular interconnection. To address inertial parametric uncertainty, a geometric adaptation law is introduced on the manifold of symmetric positive-definite matrices, ensuring physically consistent parameter estimates while retaining compatibility with the Lie-algebraic control formulation. Under the adaptive controller, semi-global uniform ultimate boundedness of the closed-loop tracking and parameter estimation errors is proven. Numerical simulations on a redundant high-inertia robotic manipulator demonstrate accurate pose tracking, smooth transient behavior, orientation regulation, and robustness under inertial perturbations. Comparative studies with state-of-the-art methods further illustrate the effectiveness of the proposed framework for complex robotic manipulation tasks.
Generative AI Enabled Robust Sensor Placement in Cyber-Physical Power Systems: A Graph Diffusion Approach
With advancements in physical power systems and network technologies, integrated Cyber-Physical Power Systems (CPPS) have significantly enhanced system monitoring and control efficiency and reliability. This integration, however, introduces complex challenges in designing coherent CPPS, particularly as few studies concurrently address the deployment of physical layers and communication connections in the cyber layer. This paper addresses these challenges by proposing a framework for robust sensor placement to optimize anomaly detection in the physical layer and enhance communication resilience in the cyber layer. We model the CPPS as an interdependent network via a graph, allowing for simultaneous consideration of both layers. Then, we adopt the Log-normal Shadowing Path Loss (LNSPL) model to ensure reliable data transmission. Additionally, we leverage the Fiedler value to measure graph resilience against line failures and three anomaly detectors to fortify system safety. However, the optimization problem is NP-hard. Therefore, we introduce the Experience Feedback Graph Diffusion (EFGD) algorithm, which utilizes a diffusion process to generate optimal sensor placement strategies. This algorithm incorporates cross-entropy gradient and experience feedback mechanisms to expedite convergence and generate higher reward strategies. Extensive simulations demonstrate that the EFGD algorithm enhances model convergence by 18.9% over existing graph diffusion methods and improves average reward by 22.90% compared to Denoising Diffusion Policy Optimization (DDPO) and 19.57% compared to Graph Diffusion Policy Optimization (GDPO), thereby significantly bolstering the robustness and reliability of CPPS operations.
From the Linear Quadratic Regulator (LQR) to the (Deterministic) Kalman Filter in Two Easy Steps
This note is a tutorial on the deterministic version of the Kalman filter (state estimator), which is formulated as finding the state trajectory consistent with the system's equations with the minimal amount of $L^2$ process and measurement uncertainty. As stated, this is an input signal design problem with linear dynamics and an objective that is affine-quadratic in the state and inputs. The first step is to convert this problem to one with a purely quadratic objective by embedding in a larger system using ``homogeneous coordinates''. This converts the problem to a purely quadratic (i.e. an LQR) problem, but with non-standard initial or final state constraints. This latter problem can then be solved using a version of the matrix Differential Riccati Equation (DRE) for the larger LQR problem. The second step is a partitioning of this larger problem, which then yields the optimal dynamic observer and the DRE of the traditional Kalman filter. For comparison, the solution of the traditional LQ-tracking (Servomechanism) problem is also treated using a similar construction.
A Generalized Sinkhorn Algorithm for Mean-Field Schrödinger Bridge
The mean-field Schrödinger bridge (MFSB) problem concerns designing a minimum-effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schrödinger bridge, the dynamical constraint for MFSB is the mean-field limit of a population of interacting agents with controls. It serves as a natural model for large-scale multi-agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf-Cole transform for MFSB and, building on it, design a Sinkhorn-type recursive algorithm to solve the associated system of integro-PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.
Choose Wisely: Data-driven Predictive Control for Nonlinear Systems Using Online Data Selection
This paper proposes Select-Data-driven Predictive Control (Select-DPC), a new method for controlling nonlinear systems using output-feedback for which data are available but an explicit model is not. At each timestep, Select-DPC employs only the most relevant data to implicitly linearize the dynamics in "trajectory space". Then, taking user-defined output constraints into account, it makes control decisions using a convex optimization. This optimal control is applied in a receding-horizon manner. As the online data-selection is the core of Select-DPC, we propose and verify both norm-based and manifold-embedding-based selection methods. We evaluate Select-DPC on three benchmark nonlinear system simulators -- rocket-landing, a robotic arm and cart-pole inverted pendulum swing-up -- comparing them with standard Data-enabled Predictive Control (DeePC) and Time-Windowed DeePC methods, and find that Select-DPC outperforms both methods.
Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks ICML'26
Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.
comment: Forty-Third International Conference on Machine Learning (ICML'26)
Data-informativity conditions for structured linear systems with implications for dynamic networks
When estimating a single subsystem (module) in a linear dynamic network with a prediction error method, a data-informativity condition needs to be satisfied for arriving at a consistent module estimate. This concerns a condition on input signals in the constructed, possibly MIMO (multiple input multiple output) predictor model being persistently exciting, which is typically guaranteed if the input spectrum is positive definite for a sufficient number of frequencies. Generically, the condition can be formulated as a path-based condition on the graph of the network model. The current condition has two elements of possible conservatism: (a) rather than focussing on the full MIMO model, one would like to be able to focus on consistently estimating the target module only, and (b) structural information, such as structural zero elements in the interconnection structure or known subsystems, should be taken into account. In this paper relaxed conditions for data-informativity are derived addressing these two issues, leading to relaxed path-based conditions on the network graph. This leads to experimental conditions that are less strict, i.e. require a smaller number of external excitation signals. Additionally, the new expressions for data-informativity in identification are shown to be closely related to earlier derived conditions for (generic) single module identifiability.
comment: 17 pages, 5 figures
KANELÉ: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation
Low-latency, resource-efficient neural network inference on FPGAs is essential for applications demanding real-time capability and low power. Lookup table (LUT)-based neural networks are a common solution, combining strong representational power with efficient FPGA implementation. In this work, we introduce KANELÉ, a framework that exploits the unique properties of Kolmogorov-Arnold Networks (KANs) for FPGA deployment. Unlike traditional multilayer perceptrons (MLPs), KANs employ learnable one-dimensional splines with fixed domains as edge activations, a structure naturally suited to discretization and efficient LUT mapping. We present the first systematic design flow for implementing KANs on FPGAs, co-optimizing training with quantization and pruning to enable compact, high-throughput, and low-latency KAN architectures. Our results demonstrate up to a 2700x speedup and orders of magnitude resource savings compared to prior KAN-on-FPGA approaches. Moreover, KANELÉ matches or surpasses other LUT-based architectures on widely used benchmarks, particularly for tasks involving symbolic or physical formulas, while balancing resource usage across FPGA hardware. Finally, we showcase the versatility of the framework by extending it to real-time, power-efficient control systems.
comment: International Symposium on Field-Programmable Gate Arrays 2026 (ISFPGA'2026)
Cyber Resilience of Three-phase Unbalanced Distribution System Restoration under Sparse Adversarial Attack on Load Forecasting
System restoration is critical for power system resilience, nonetheless, its growing reliance on artificial intelligence (AI)-based load forecasting introduces significant cybersecurity risks. Inaccurate forecasts can lead to infeasible planning, voltage and frequency violations, and unsuccessful recovery of de-energized segments, yet the resilience of restoration processes to such attacks remains largely unexplored. This paper addresses this gap by quantifying how adversarially manipulated forecasts impact restoration feasibility and grid security. We develop a gradient-based sparse adversarial attack that strategically perturbs the most influential spatiotemporal inputs, exposing vulnerabilities in forecasting models while maintaining stealth. We further create a restoration-aware validation framework that embeds these compromised forecasts into a sequential restoration model and evaluates operational feasibility using an unbalanced three-phase optimal power flow formulation. Simulation results show that the proposed approach is more efficient and stealthier than baseline attacks. It reveals system-level failures, such as voltage and power ramping violations that prevent the restoration of critical loads. These findings provide actionable insights for designing cybersecurity-aware restoration planning frameworks.
comment: 10 pages, 7 figures
Robotics
CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation ICRA
Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2026: ROSE International Workshop on Robotics Software Engineering, June 01, 2026, Vienna, Austria
Unified Motion-Action Modeling for Heterogeneous Robot Learning
We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.
comment: https://uma-manipulation.github.io/
Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking
comment: 21 pages, 4 figures, 15 tables. Project page: https://ndb796.github.io/BinaryTracking ; Code and dataset: https://github.com/ndb796/BinaryTracking
LOPAL: Local Performance-Aware Active Learning from Imperfect Demonstrations
Learning from Demonstration (LfD) enables intuitive robot skill acquisition by allowing robots to learn directly from human task demonstrations. However, current methods often fail to address the fact that due to suboptimal and inconsistent human behavior, the quality of the demonstration can vary within each demonstration. Therefore, we introduce LOPAL (LOcal Performance-aware Active Learning), an active learning approach that leverages this local demonstration quality information. Our approach consists of two synergistic components. First, a local performance-driven LfD method uses a Gaussian Mixture Model (GMM) to encode both the demonstrated trajectories and their associated local quality assessments. This enables the generation of trajectories that outperform the imperfect demonstrations by utilizing complementary local data of high performance. Second, active data acquisition allows to improve beyond the imperfect demonstrations by collecting additional informative samples. In areas missing good data, the user is actively requested to provide corrections through a shared autonomy (SA) mechanism, while the robot autonomously executes the learned behavior. The efficacy of LOPAL was validated in both a simulation and a real-world experiment. The results from a real-world pipe inspection task showed that the proposed approach can achieve up to 27.31 % improvement in task performance while also reducing the effort required to collect the demonstrations.
comment: Accepted for publication in IEEE Robotics and Automation Letters (RAL), 2026
SGM-SLAM: Scene Graph Matching for Data-Efficient Distributed SLAM
We introduce a data-efficient distributed Simultaneous Localization and Mapping (SLAM) framework designed for a team of robots equipped with LiDAR, cameras, and inertial sensors. Our framework uses scene graph matching to identify inter-robot measurement constraints. Unlike prior approaches that rely on feature-level matching, our framework is the first to perform scene graph matching using only object labels and centroids. Our approach constructs a scene graph by using fused RGB-LiDAR point clouds to generate both a semantically segmented point cloud layer, and a layer of discrete bounded objects, to accompany estimated robot trajectories. Scene graph matching is performed collaboratively through exchanging and matching object data with neighboring robots. To maximize communication efficiency, we utilize a multi-step data exchange and optimization process. We demonstrate the effectiveness and efficiency of our approach using both simulation and real-world datasets collected by legged robots in indoor and outdoor environments.
ExoTraj: A General Lower-limb Exoskeleton Assistance Policy for Complex Environments
Adaptive torque prediction in dynamic exoskeleton scenarios requires expensive motion capture systems, which are infeasible in complex outdoor environments. Trajectory prediction has emerged as one of the effective approaches to address such an issue. However, the core challenges of exoskeleton trajectory prediction are twofold: establishing the mapping from multi-modal features to trajectory information; constructing the mapping from trajectory to torque. For the former, most existing methods perform only single-step prediction and neglect inter-subject trajectory variability, thereby limiting the trajectory optimization space and prediction generalization. To address this, this paper proposes a fast flow matching method that enables accurate trajectory prediction and better generalization for real-time performance, where trajectory generation errors and encoded observations are used to guide the training direction. For the second challenge, due to the high dynamics of the human-robot system and the strong coupling between perception and control, simple control methods struggle to achieve efficient assistance based on the predicted trajectory. This paper utilizes model predictive control and designs a novel optimization objective to optimize torque, ensuring the exoskeleton achieves comfortable and robust assistance. By integrating the above two components, the unified policy, denoted as ExoTraj, is developed to enable adaptive assistance in complex outdoor scenarios without high data acquisition cost. Experimental results show that compared to traditional methods, ExoTraj reduces cross-subject prediction error by 14.0% during the online phase and maintains robustness against external noise. Relative to the zero torque condition, ExoTraj decreases metabolic rate by 11.5-24.4%, heart rate by 1.7-19.5%, and peak muscle activation levels by 10.9-41.3%, respectively.
comment: 28 pages, 19 figures, project page: https://xiaoyinliu0714.github.io/Home_ExoTraj/
Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning ICML 2026
Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.
comment: ICML 2026 (Oral)
ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies
Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.
comment: Homepage: https://flageval-baai.github.io/AtomBenchPage
SoK: Security and Privacy of Foundation-Model-Powered Robots
Foundation models are reshaping robotics by enabling robots to interpret open-ended instructions, reason over multimodal contexts, and operate in complex, open-world environments. However, their integration also introduces security and privacy (S&P) risks that extend beyond the FMs themselves to embodied execution pipelines, supporting ecosystems, and broader governance impacts. Existing literature reviews provide valuable insights but often focus on specific FM types, risk categories, mitigation strategies, or trust boundaries. Consequently, the field lacks a unified structure for analyzing where risks originate, how they propagate across robotic systems, and where mitigations should intervene. To address this gap, we propose a progressive F-E-S-G structural boundary framework for analyzing the S&P of FM-powered robots. The framework comprises four layers: the Foundation model layer (F), Embodied system layer (E), Supporting ecosystem layer (S), and Governance impact layer (G). Building on this structure, we develop a multi-level taxonomy that organizes prior studies along three levels: F-E-S-G trust boundary, security-privacy concerns, and risk-mitigation perspectives. We further annotate each study using fine-grained coding attributes, including target, lifecycle stage, mechanism, system access, and effect. Guided by this framework and taxonomy, we systematize 96 papers. Our analysis uncovers multiple threat patterns, defense mismatches, and evaluation gaps that are difficult to identify from a single-boundary perspective. Based on these findings, we identify open challenges and future directions to provide a research agenda for developing secure, privacy-preserving, and responsibly governed FM-powered robotic systems.
comment: 21 pages, 2 figures
DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps
Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.
DataLadder: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid
Generalist robot policies require trustworthy evaluation and robot-usable training data, but both are difficult to scale with physical robots alone. Real-robot trials and demonstrations remain the most faithful source of deployment signals, yet they are slow, costly, and hard to reproduce. We present DataLadder, a simulation-enabled interconversion toolchain for human-robot aligned model evaluation and data generation, denoted as Robot $\rightleftharpoons$ Simulation $\rightleftharpoons$ Human. On the one hand, the Robot $\rightarrow$ Simulation $\rightarrow$ Human pathway supports human-robot aligned model evaluation by reconstructing real-robot tabletop organization tasks as calibrated digital twins for scalable evaluation, while using human embodied feedback to inspect and refine the naturalness of simulated motions. On the other hand, the Human $\rightarrow$ Simulation $\rightarrow$ Robot pathway supports human-robot aligned data generation: it lifts ego-centric human demonstrations into simulation, checks them under robot physical constraints, and converts them into robot-centered trajectories, annotations, and visual observations. Together, these pathways use the JoySim simulator as both a scalable evaluation layer and a physical consistency filter for robot data generation. We further package the core reconstruction, simulation, rendering, and realism-augmentation modules as cloud services on JD Cloud, turning the system into reusable infrastructure for robot data generation and model evaluation.
comment: Project Page: https://joyai-sim.github.io/
Pride and Prejudice: Toward an Information-Theoretic Framework for Mutually Communicative Driver Behavior Modeling
Mixed autonomy driving becomes unsafe and inefficient when autonomous vehicles (AVs) and human-driven vehicles (HVs) misread each other's intentions. We study this problem as implicit mutual communication in lane changes. The proposed framework models how the ego vehicle both expresses its intent and probes the other driver's preference under epistemic uncertainty. It combines a level-k Bayesian persuasion game with virtual features for proactive signaling, information-theoretic rewards for mutual communication, and adaptive weights of communication affordances. We further introduce the Pride-Inquiry (P-I) and Pride-Prejudice (P-P) planes to analyze communication intensity and tendency. The model is calibrated with a Communication-Based Multi-Agent Inverse Reinforcement Learning algorithm (C-MIRL) on the naturalistic NGSIM dataset. Compared with the non-communicative baseline, the proposed model reduces the prediction error of mandatory lane changes by up to 20% while maintaining strong generalization. Driver-In-the-Loop questionnaire scores are positively correlated with the calibrated communication variables, supporting the subjective validity of the model. The learned rewards further show that inquiry and listening affordances contribute more than pride and expression alone, and that inquiry preference varies more strongly across drivers. These results support explicit modeling of mutual communication and epistemic uncertainty in interactive driving.
comment: 16 pages, 10 figures. Accepted for the IEEE Transactions on Intelligent Transportation Systems (T-ITS), June 2026
VENOM: Versatile Embodied Network for Omni-bodied Motion tracking
Achieving expert-level expressive full-body motion tracking across multiple humanoids solely from demonstration data remains a challenging and relatively an underexplored problem in humanoid robot learning. Cross-embodiment motion tracking policies are mostly trained by decoupling the control problem into upper and lower body control. This work proposes VENOM, a cross-embodiment full-body motion tracking model for humanoids in simulation. VENOM is a GPT-based motion tracker trained on multiple humanoid data that can track the entire body without the requirement to split into upper and lower body control. We curate a multi-humanoid motion tracking dataset called the VENOM dataset that contains states, actions, and rewards and train VENOM and the baselines on this dataset. In this letter, we evaluate VENOM's performance against baselines and show that we can achieve a stable motion tracker across different humanoids more capable than an MLP trained on multiple humanoid data with supervised learning alone, and also show that despite lack of reward feedback, VENOM closely matches the tracking capability of experts that were trained using asymmetric-actor critic reinforcement learning.
PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation
Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: https://yananzhou5555.github.io/PATCH/.
Towards mm-Level Accurate UWB Radar: High-Accuracy Phase-Based Obstacle Detection through Multi-Channel Fusion
Accurate, tag-free distance estimation with ultrawideband (UWB) radar is essential for applications such as autonomous guided vehicles, robotics, and environment characterization. For tag-based localization systems, phase-based UWB signal processing techniques have demonstrated sub-wavelength ranging precision, but these approaches are not applicable for passive (tagless) radar setups with weak reflections, mixed multipath conditions, and the absence of a known time-of-flight (ToF) first-path reference. This paper demonstrates for the first time that phase information can be effectively exploited in a fully passive UWB radar setting. We introduce a signal processing framework that extracts reliable distance information by combining coarse amplitude-based estimates with high-resolution phase changes across multiple frequency channels. By referencing phase measurements with the line-of-sight component, the method compensates for hardware-induced phase drift, while the use of multichannel frequency diversity enables disambiguation of periodic phase information and improves robustness against frequencyspecific channel degradation such as Fresnel zones. The proposed approach is validated on a robot equipped with a bistatic UWB radar using DW3000 devices and evaluated in a realistic metallic industrial environment. Experimental results show that our work consistently achieves centimeter-level accuracy even at high speeds, with a median error of 1.69 cm, significantly outperforming existing ~10cm accuracy UWB radar approaches relying only on amplitude-information. We further show how multi-channel fusion exploits uncorrelated channel degradation to reduce the error by more than 40% compared to single-channel operation, and outline how phase modeling and fusion can be pushed toward sub-centimeter accuracy.
comment: 13 pages, Submitted to IEEE Transactions On Wireless Communications
Reinforcement Learning with Inner-loop Dynamics Estimator for Aerial Manipulation under Uncertainty
Aerial manipulators enable physical interaction in hard-to-reach environments; however, the combined problem of direct whole-body aerial manipulation under rapid arm motion, payload changes, and related unknown dynamic uncertainty remains a largely unsolved problem. We present a hierarchical control framework that combines Reinforcement Learning (RL) with an inner-loop dynamics estimator to address this problem. The RL outer loop maps desired 6-degrees-of-freedom (DOF) end-effector targets to coordinated whole-body commands, enabling direct task-driven control without relying on a fully accurate coupled dynamic model in the policy layer. An inner loop then tracks these commands while compensating for transient inertial shifts and uncertainty during execution via a dynamics estimator scheme without requiring system model knowledge. We validate the proposed approach on a custom quadrotor equipped with a 3-DoF manipulator through hardware experiments under varying payload conditions. Compared with RL+PID and RL+INDI+PID baselines, the proposed method reduces end-effector tracking error and improves task success rate across the tested hardware conditions. These results show that combining learned whole-body coordination with estimator-based low-level compensation improves the precision and robustness of aerial manipulation under changing operating conditions.
WaveSync: Constrained Wavefront Optimization for Synchronized Co-Speech Gestures in Humanoid Robots
Expressive co-speech gestures are crucial for natural human-robot interaction, but generating them on physical humanoid robots is difficult because gesture strokes must align with speech emphasis while satisfying strict kinematic and dynamic constraints. Unlike virtual avatars, humanoid robots cannot freely execute rapid or overlapping motions, making word-level synchronization and hardware-safe motion planning a coupled problem. We present \textbf{WaveSync}, a hybrid framework in which a Large Language Model decomposes dialogue responses into structured semantic schemas and assigns per-word importance weights, constructing a continuous Semantic Importance Wave. Gesture trajectories are shaped through Dynamic Movement Primitives, enforcing kinematic feasibility while enhancing expressiveness. A Wavefront Optimization stage aligns peak-to-peak gesture-speech synchronization and resolves residual kinematic violations through gesture-duration compression and forward propagation. Experimental evaluation based on five dialogue scenarios shows that our method achieves high synchronization accuracy and outperforms three baselines in both objective and subjective evaluations. Each component in WaveSync plays a necessary role in producing gestures that are expressive, semantically grounded, and kinematically compliant. The code, resources, and videos are available at \href{https://github.com/pairs-lab/WaveSync}{WaveSync}
Steering Generative Reinforcement Learning into Stable Robotic Controller
Diffusion and flow-based generative policies provide a powerful policy class for reinforcement learning by inducing rich stochastic exploration through iterative action generation. However, the stochasticity of diffusion policies is not suitable for stable and precise control in high-dimensional robotic systems, where small action variations can accumulate into inconsistent motion and reduced robustness. To address this issue, we propose SteerGenPO, a latent-space reinforcement learning framework that steers a trained generative policy into a robust deterministic robotic controller. The key idea is to replace stochastic latent sampling of the trained generative policy with a learned latent actor that predicts a state-dependent latent input for the generative policies. This separates exploration and control: stochastic generative sampling provides diverse action proposals during policy learning, while deterministic latent steering provides stable and adaptive control at deployment. We evaluate SteerGenPO on six Isaac Lab benchmarks and a Unitree G1 locomotion task. The results show SteerGenPO improves over both classical RL and generative RL baselines, while its deterministic latent steering produces more stable inference-time behaviors and more reliable command responses.
Automated Digital Twin Construction for Highway Scenarios Using LiDAR Point Clouds and OpenStreetMap
Accurate road environment modeling is fundamental to the simulation and validation of automated driving systems. However, constructing road maps in standardized formats such as ASAM OpenDRIVE from real-world sensor data remains a time-consuming and costly process. Mobile mapping LiDAR captures accurate lane-level geometry but is confined to the driven corridor, while OpenStreetMap (OSM) provides broad road network topology but lacks geometric precision at the lane level. To address this, an automated workflow is proposed to fuse LiDAR point clouds with OSM data to generate georeferenced ASAM OpenDRIVE maps of highway environments, requiring minimal manual intervention. The pipeline reconstructs mainline roads from LiDAR-derived measurements and infers ramp geometry and topology from the OSM road graph, enabling complete highway interchange modeling without full sensor coverage. Experiments demonstrate a mean lateral RMSE of 0.740 m, and the generated maps are directly usable in mainstream simulation platforms including IPG CarMaker and Esmini. These results validate the effectiveness of combining measurement-derived geometry with map-derived topology for automated OpenDRIVE digital twin generation. The project code is available at https://github.com/ftgTUGraz/opendrive-digital-twin-generator
comment: 9 pages, 5 figures
PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models
Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.
comment: Project page: https://rckola.github.io/prose/
Elastic ODYN: Differentiable Optimization for Infeasible Control and Learning in Robotics
Robotic systems routinely encounter conflicting objectives, modeling errors, and degenerate contact conditions that render quadratic programs (QPs) infeasible. Yet most optimization solvers and differentiable QP layers assume feasibility, leading to numerical failures, unstable gradients, or solver breakdown when constraints cannot be simultaneously satisfied. We present Elastic ODYN, a primal--dual non-interior-point QP solver that handles infeasibility through smooth squared-$\ell_2$ elastic relaxations. The resulting formulation remains well posed under ill-conditioning and degeneracy, supports warm starting, and converges to closest-to-feasible solutions when no feasible point exists. A lightweight refinement stage recovers physically meaningful dual variables from the elastic solution. Building on this framework, we develop Elastic OdynLayer, a differentiable QP layer with stable gradients under infeasibility, and Elastic OdynSQP, an infeasibility-aware SQP method that resolves inconsistent subproblems and intrinsically infeasible optimal control tasks through selective constraint relaxation. We evaluate the framework on benchmark QPs, singular contact mechanics, differentiable parameter identification, and quadrupedal and humanoid trajectory optimization. Across all settings, Elastic ODYN consistently outperforms state-of-the-art elastic QP solvers in robustness, warm-start performance, and convergence reliability, enabling optimization, simulation, control, and learning beyond the feasibility assumptions of existing methods.
comment: 8 pages, 5 figures, 2 tables
ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning SC
Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.
comment: 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version
ADAPT: Analytical Disturbance-Aware Policy Training for Humanoid Locomotion
Humanoids deployed in human-centered environments must handle force-interactive tasks, where external contacts introduce unexpected disturbances that disrupt locomotion accuracy and stability. Existing learning-based approaches rely on broad domain randomization, task-specific force objectives, or learning-based force estimators from motion history, each of which compromises accuracy, task transferability, or out-of-distribution (OOD) robustness. We present Analytical Disturbance-Aware Policy Training (ADAPT), a framework that equips humanoid policies with a physically grounded disturbance observer. The core of ADAPT is an analytical whole-body disturbance observer that estimates residual force/torque online with the accessible robot dynamics, without requiring force/torque sensors. Fed directly into the policy, the estimated disturbances give the humanoid an explicit, physics-derived sense of external force/torque that can generalize across diverse unseen scenes. Experiments on a Unitree G1 humanoid show that ADAPT achieves accurate disturbance prediction and stronger robustness than a proprioception-only baseline under torso perturbations, standing pushes, and asymmetric hand payloads, with improved velocity tracking even on OOD disturbances. Moreover, ADAPT enables penalizing inferred disturbances at lower-body joints to encourage lighter locomotion.
Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning ICML 2026
Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.
comment: 17 pages, Accepted to the 2nd Workshop on Compositional Learning at ICML 2026 (Seoul, South Korea)
Agile Fall Recovery for Quadrotors with Bidirectional Thrust via Reinforcement Learning
Autonomous fall recovery is a critical capability for quadrotors operating in real-world environments, where collisions or failures may leave the vehicle resting on the ground in an arbitrary attitude. This problem is challenging because recovery must be achieved under limited onboard sensing, in constrained free space, with ground contact, and in the presence of unknown disturbances. In this letter, we present an RL-based framework for autonomous fall recovery of a quadrotor from arbitrary ground attitudes to stable hover using only lightweight onboard sensors. To address severe partial observability and intermittent sensor invalidity, we train a recurrent policy within an asymmetric actor--critic architecture, leveraging an Incremental Nonlinear Dynamic Inversion (INDI) controller to track the policy output. Combined with high-fidelity simulations of motor response and optical flow, the overall training framework significantly reduces the sim-to-real gap. Simulation ablation studies validate the importance of the main design choices, while real-world experiments demonstrate zero-shot transfer and robust recovery under different initial attitudes, wind disturbances, and additional payloads. These results demonstrate that agile quadrotor fall recovery can be achieved without explicit state estimation using only limited and unreliable onboard sensing.
APEX: Adaptive Policy Execution for Precise Manipulation
Modern imitation learning methods, including visuomotor and Vision-Language-Action (VLA) policies, typically output high-level action references that are executed by low-level controllers. However, the absence of higher-order reference signals, together with the policy's lack of awareness of the underlying low-level control dynamics during training, inevitably induces an execution gap. As a result, realized actions deviate systematically from policy-commanded ones, with a critical impact on precision-sensitive manipulation. Prior work either modifies the policy architecture or the low-level controller, both requiring intrusive changes to the pretrained policy or packaged controller. This raises a natural question: when the policy and controller are both treated as inaccessible black boxes, can we bridge the execution gap? We propose Adaptive Policy Execution (APEX), a plug-and-play framework inserted between the policy and the controller that reconstructs a dynamically feasible reference from policy outputs and adapts at test-time according to low-level state feedback, with a provable convergence guarantee. Extensive empirical studies show that APEX reduces controller-induced tracking error by 41.2% on demonstration replay and improves manipulation success by 4.8--25.8 percentage points across four visuomotor and VLA policy classes.
comment: 20 pages, 9 figures, 4 tables
HATS: A Human-Agent Teleoperation System for Multi-Arm Data Collection
Many real-world manipulation scenarios, such as handling complex collaborative tasks and dealing with large workspaces, require coordination of more than two robotic arms. Consequently, an effective multi-arm teleoperation system is required to collect demonstrations for training coordinated multi-arm manipulation policies. However, existing teleoperation frameworks mainly focus on single-operator or multi-operator setups, facing a practical trade-off between the cognitive load placed on a single operator and the coordination cost incurred by multiple operators. To address this problem, we introduce HATS, a human-agent teleoperation system that enables a single human operator, assisted by an MLLM-based agent, to collect data for multi-arm manipulation tasks. Our system decouples the control space: two primary arms are directly teleoperated by the human, while two assistive arms are controlled by a training-free agent that handles sub-tasks. In addition, the human operator can use voice commands to prevent collisions and correct assistive arm behaviors during execution. Extensive evaluations demonstrate that HATS achieves data collection efficiency and success rates comparable to expert dual-human teams. Moreover, downstream policy evaluations demonstrate the efficacy and quality of the data collected through HATS.
Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies
Collaborative mobile manipulation requires robots to coordinate with a partially observed partner while physically interacting through shared objects. This is difficult because failures often arise not from poor local skills, but from mistimed waiting, yielding, pulling, releasing, or repositioning. We study this problem with two bimanual mobile manipulators coupled through rigid and deformable objects. We propose Sequential Asymmetric Imitation (SAI), a single-teleoperator curriculum for learning coupled multi-robot behaviors without synchronized dual-operator demonstrations or explicit inter-robot communication. SAI trains Robot A from unilateral demonstrations with a compliant human partner, trains Robot B against the deployed Robot A policy, and then refines Robot A using sparse interventions near coordination failures. This staged process exposes the policies to increasingly realistic partner behaviors, including delay, phase mismatch,insufficient yielding, and interaction conflict. Across real-world dual-robot manipulation tasks, SAI improves task success, phase synchronization, and partner-contingent yielding over independent imitation and curriculum-ablation baselines. These results suggest that physically coupled collaboration can be learned through the structure of the imitation curriculum, rather than through synchronized multi-operator demonstrations or explicit coordination mechanisms.Project page:http://cyc0429.github.io/sai-project-page/
HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization
Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.
MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry
Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.
comment: 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands
Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.
A Formal Resilience Framework for Cyber-Physical Embodied Systems under Device-Level Cyberattacks
In cyber-physical systems (CPSs), fault tolerance is traditionally achieved by analysing sensor and actuator outputs, detecting progressive drift or sudden failures, and initiating suitable tolerance mechanisms. Reasonable under general failure models, this approach fails to capture nuanced disruptions caused by cyberattacks, which may employ subtle strategies. This is particularly critical in embodied CPSs, where computational and physical devices not only have an active role in task completion, but also in embodiment preservation (that is, maintaining the system's physical integrity). To prevent structural physical damage, embodied CPSs require a framework that enables proactive response to cyberattacks. This paper proposes a formal dependability framework that incorporates IDS information into resilience evaluation predicates, enabling assessment of tolerance to disruption and degradation. The framework supports structured reasoning about how cyberattacks affect task execution and embodiment preservation, and whether mitigation strategies must be deployed. Analytical examples demonstrate its analytical capability and soundness, establishing a theoretical foundation for dependable and secure embodied CPSs.
comment: 8 pages, 2 tables
RHO: Your Coding Agent is Secretly a Roboticist
Code-as-Policies (CaP) has shown that large language models (LLMs) can write code to solve robotics tasks by composing perception, planning, and control primitives. Recent CaP systems, however, rely on multi-turn code-generation loops at test time, which is often infeasible for real-time robot control. We introduce Robotics Harness Optimization (RHO), a novel paradigm in which tool-enabled coding agents, at training time, propose and search for interpretable, neurosymbolic multi-file policy repositories (Repositories-as-Policies) that compose these primitives rather than a single prompt, function, or file. RHO searches with reflective feedback from environment reward and execution rather than teleoperation demonstrations. It generalizes to perturbed pick-and-place settings like LIBERO-PRO, where OpenVLA scores 0.0% and $π_{0.5}$ averages 12.83%. Using the same low-level primitives, RHO reaches a 45.0% success rate, 2.5x higher than the strongest multi-turn agentic system, and 3.5x higher than $π_{0.5}$. On Robosuite, RHO sets a new state-of-the-art of 70.0%, exceeding the prior multi-turn record of 68.29% using single-turn execution with no corrective LLM code edits at deployment. When an LLM is used in the control loop, as on RAI's O3DE benchmark, RHO optimizes the deployed agent's multi-file harness of prompts, tools, and control code, improving held-out success from 23.5% to 44.3% with 20% less wall-clock time and 27% fewer tool calls.
comment: 46 pages, 9 figures, 15 tables. Project page: https://rho-robotics.github.io
Training and Evaluating Diffusion Policies with Long Context Lengths
Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.
V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos
Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.
An Augmented Reality Brain-Robot Interface for Generalist Robot Arm Manipulation
The integration of augmented reality (AR) and EEG-based brain-computer interfaces (BCIs) offers a promising path for enabling intuitive control of robots for assistive purposes. However, existing AR brain-robot interface (BRI) systems are often constrained to task-specific structures, limiting their utility in real-world environments. We present an AR BRI designed for generalist robot arm manipulation that combines gaze-based object selection with motor imagery action control. Our system uses eye-tracking for intuitive object targeting and context-aware visual overlays ("Place" and "Use") to guide the user through tasks within a shared autonomy framework. We evaluated the interface through a feasibility study with 18 healthy participants performing three multi-step activities of daily living: drinking, using a drawer, and operating an oven. Our results demonstrate that this interaction paradigm enables effective sequential task execution and high user engagement, achieving a "Good" usability rating (SUS > 70). These findings support the feasibility of the proposed interaction paradigm for complex BCI-driven robotic assistance, and motivate future evaluation with the intended target population. Project website: https://ar-bri-manip.github.io/.
comment: Accepted at the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
SemGeoNav:A Safety-Guided Visual Navigation Approach with Semantic Reasoning and Geometric Planning
Learning-based visual navigation has enhanced semantic goal-reaching capabilities. However, due to their black-box nature, purely end-to-end models often lack explicit geometric constraints, leading to unpredictable and unreliable obstacle avoidance in open environments. Conversely, traditional geometric planners ensure safety but struggle with high-dimensional visual targets. To address these limitations, we propose SemGeoNav, a novel hierarchical visual navigation framework.It tightly integrates the high-level semantic reasoning of end-to-end models with the reliable local planning ability of geometry-based methods, achieving robust image-based navigation while significantly improving obstacle avoidance. Furthermore, we introduce a temporal trajectory smoothing mechanism to ensure continuous and stable robot motion. We evaluated SemGeoNav on a Unitree Go2 quadruped robot in real-world environments. The results demonstrate that SemGeoNav outperforms existing representative methods, including ViNT and NoMaD, achieving higher success rates and shorter navigation times.
comment: The paper has been accepted by ICGNC 2026
ART-Glove: Articulated Tactile Glove for Contact-Grounded Dexterous Interaction Capture
We present ART-Glove, an articulated tactile glove designed to capture contact-grounded dexterous demonstrations while preserving human dexterity. ART-Glove makes hand-side contact geometry explicit with 16 rigid functional surfaces covering the fingers, thumb, and palm. Twenty-two anatomically aligned joints connect these surfaces and allow them to follow human hand motion during dexterous manipulation. Encoder-based sensing tracks surface motion, while dense piezoresistive tactile sensing records contact over the same surfaces. The complete system captures synchronized 22-DoF joint measurements and 2048-taxel tactile measurements at 120 Hz. We evaluate ART-Glove across experiments on motion freedom, joint sensing, tactile sensing, and contact-rich interaction capture, demonstrating its ability to preserve human dexterity while recording contact-grounded information that can support downstream dexterous robot learning.
Is Your Trajectory Displacement Safe in Long-tail?
Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.
comment: 20 pages, 15 figures
FlowMPC: Improving Flow Matching policies with World Models
Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.
TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation
Human hand-object demonstrations provide dense reference motions for training dexterous manipulation reinforcement learning (RL) policies through reference tracking. However, to use such demonstrations for RL policy learning, retargeting must preserve hand pose and task-relevant hand-object contact structure. Otherwise, contact and feasibility artifacts can degrade downstream RL policy performance. We introduce TopoRetarget, an interaction-preserving retargeting framework that uses a single set of parameters across diverse retargeting conditions while maintaining task-relevant hand-object interaction and adapting human demonstrations to dexterous robot hands. The method constructs a sparse interaction graph over hand and object keypoints and optimizes distance-weighted Laplacian deformation with directional consistency, kinematic constraints, and penetration handling. Evaluations show that the generated references improve both interaction fidelity and policy learning: TopoRetarget achieves the best contact precision and alignment over all baselines on the ContactPose Dataset, improves Pen-Spin training success by 40.6 percentage points over the existing baseline methods, and enables zero-shot transfer to Wuji Hand hardware on cube reorientation and pen spinning.
comment: Project page: https://toporetarget2026.github.io/TopoRetarget/
PolyMerge: Compressing 3D Gaussian Splats with Polytope Coverings for Provably Safe Resource-Constrained Navigation
Obstacle avoidance is essential for safe navigation and motion planning. Recent radiance field reconstruction methods enable object detection and modeling with high fidelity, but remain too memory- and compute-intensive for on-board perception-based path planning. To address these limitations, we propose PolyMerge to convert a large, photorealistic 3D Gaussian Splatting (3DGS) model of a scene into a lightweight representation of convex polytopes whose union provably over-approximates all obstacles in the original 3DGS model. PolyMerge tunes the polytope count to trade off conservativeness and compute cost, and integrates with control barrier functions (CBFs) to plan collision-free paths. We showcase PolyMerge in simulation and hardware experiments on a Crazyflie drone, which uses PolyMerge to compute and follow safe trajectories in real time under severe onboard compute constraints, outperforming baselines in speed while guaranteeing safety. For our code and videos, visit https://athlon76.github.io/PolyMerge-website/.
ATHENA: Accelerated Multi-Task Heterogeneous Influence Functions for Robot Data Curation
In robot imitation learning, influence functions provide a principled approach to quantify each demonstration's effect on robot task outcomes, yet scaling them to billion-parameter Vision-Language-Action (VLA) models is limited by computational and multitask bottlenecks. To this end, we propose ATHENA, an influence function framework tailored for multitask VLA data curation at a billion-parameter scale. Concretely, it leverages the Kronecker structure of linear-layer gradients to reduce projection cost, and approximates dense Hessian inversion with a rank-r Random Truncated Approximation, achieving about a 313.4x speedup in influence computation. Furthermore, ATHENA formulates global and local interactive influence to balance data curation across 50 jointly trained tasks. Extensive evaluations on RoboTwin 2.0 and real-robot deployment, covering 9.34 and 6.90 hours of demonstrations, respectively, show that ATHENA matches or exceeds full-data joint fine-tuning using only 50% of demonstrations in simulation and 66.7% of data across six real-robot tasks. Overall, ATHENA demonstrates its effectiveness for data curation in billion-parameter multitask VLA fine-tuning.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.
comment: Project Page: https://hjhyunjinkim.github.io/EgoPhys
Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks
Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory
comment: 14 pages, 9 Figures, 8 Tables
Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints
This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.
A Deployment Case Study in Robotic Apparel Automation: Digital Twin Integration, Interoperability, and Workforce Enablement ICRA 2026
Despite steady advances in flexible automation in sectors such as electronics and automotive manufacturing, apparel automation remains challenging because fabrics are deformable and difficult to manipulate with robots. This paper presents a deployment-oriented case study of a robotic sewing system for denim manufacturing, emphasizing the system-level integration required for practical adoption. At the engineering level, a digital thread module parses DXF production drawings into process parameters and executable robot trajectories, reducing manual programming effort and enabling rapid re-targeting across sewing operations. In parallel, a digital twin of the workcell is used during pre-deployment to validate reach and clearance, refine layout and sequencing, evaluate operator access, and assess cycle-time compatibility with upstream and downstream tasks, thereby reducing commissioning risk. At deployment, the system integrates a collaborative robot with conventional sewing equipment, welding, suction fixtures, and machine-level controllers through an interoperability layer. Runtime monitoring and verification, including seam monitoring, collision checking, and trajectory-level validation, improve robustness under environmental variability, while operator-facing training and guidance tools support setup, troubleshooting, and technology adoption. Two staged factory deployments on denim shorts, covering 2D pocket operations and 3D garment-shaping seams, show that digital-twin-based validation, digital-thread-driven task generation, interoperability, runtime verification, and operator training are important for scaling robotic apparel automation.
comment: 4 pages, 3 figures, IEEE ICRA 2026 Workshop Paper
T-Rex: Tactile-Reactive Dexterous Manipulation
The ability to react dynamically to tactile signals has long been considered crucial to agile human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality or are limited to encoders with static cues, due in part to the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and limitations of static tactile encoders. In this paper, we push the frontier of tactile-reactive manipulation by addressing all of these limitations. We propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To effectively exploit naturally high-frequency touch signals without sacrificing the existing capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable object manipulation, achieving over 30% higher average success rate than the strongest baseline.
comment: Project page: https://tactile-rex.github.io/
Human Universal Grasping
Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/
comment: 28 pages, 20 figures, 7 tables
Geometric Action Model for Robot Policy Learning
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
comment: Project page: https://cvlab-kaist.github.io/Geometric-Action-Model/
Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.
comment: Website: https://acerobotics-vla.github.io/HABC-Website
R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies
Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.
comment: Project page: https://r2rdreamer.github.io/
ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.
Task-Error Residual Learning for Real-Robot Five-Ball Juggling
For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at https://kai-ploeger.com/residual-juggling.
comment: Submitted to the 2026 International Symposium on Robotics Research (ISRR)
When Should a Robot Replan? Regret-Guided Update Scheduling in Time-Varying MDPs
Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.
SidewalkBench: Benchmarking Visual Navigation on Urban Sidewalks
Urban sidewalk navigation presents significant challenges due to complex structural layouts, dynamic pedestrian behaviors, and long distances. While recent visual navigation models offer a promising solution, the lack of a unified benchmark hinders quantitative and reproducible evaluation. To bridge this gap, we propose SidewalkBench, a comprehensive benchmark designed for visual navigation on urban sidewalks. Built upon NVIDIA Isaac Sim, SidewalkBench brings GPU-accelerated simulation of diverse, high-fidelity sidewalk environments, including both procedurally generated and real-world scanned scenes. We further populate the scenes with rich, reactive event-based pedestrian behaviors and flexible, efficient animation, enabling standardized model evaluation under realistic real-world settings. We conduct a comprehensive evaluation of 9 visual navigation models on 330 unit-test scenarios, 800 pedestrian-reactive scenarios, and 105 long-horizon scenarios. Our findings highlight that pedestrian interaction and long-horizon robustness remain critical bottlenecks for existing models, and scaling up sidewalk training with synthetic data emerges as a promising solution.
comment: Project Page: https://vail-ucla.github.io/SidewalkBench/
DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models
Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.
comment: Under Review
Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators
Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude--manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.
comment: 8 pages, 4 figures
Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance
Robotic assistance in household environments requires not only predicting where objects should be placed, but also reasoning about when objects should not be placed at all. Existing approaches to personalized object rearrangement primarily focus on placement decisions under the assumption of clean observations and complete actionability, limiting their applicability in realistic, cluttered, and partially erroneous settings. In this paper, we introduce APOLLO, a hybrid framework for abstention-aware personalized object rearrangement that combines a lightweight, personalized embedding model (PEM) with selective large language model (LLM) assistance. PEM is trained for each user-environment pair using a small number of demonstrations, operates entirely on CPU, and produces uncertainty estimates, which are used to selectively invoke LLM-based reasoning only for ambiguous decisions, balancing efficiency, privacy, and reasoning capability. To evaluate this formulation beyond existing benchmarks, we introduce APOR, a synthetic, LLM-generated dataset that captures room-level, multi-furniture environments, diverse organizational profiles, explicit abstention behavior, and noisy partial scene context. Extensive experiments on both PARSEC and APOR provide initial evidence that APOLLO improves over prior LLM-based baselines in controlled benchmark settings while substantially reducing LLM usage. Code is available at https://github.com/PaInt-Lab/APOLLO.
comment: Accepted at the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
VISTA: Scale-Aware Visual Navigation via Action History Conditioning
Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model's predictions and the robot's actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.
Contrastive Action-Image Pre-training for Visuomotor Control
Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.
Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception
Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.
Intermittent Strategic Cooperation of Two Selfish Agents on Graphs
We study strategic space- and time-constrained cooperation between two self-interested agents through the Intermittent Strategic Cooperation-Based Two-Agent Path Planning (IC2PP) problem, a shortest-path game on graphs in which agents navigate toward individual targets while optionally cooperating at specific nodes to reduce their own travel times. Although such cooperation can strictly benefit both agents, it is strategically fragile: agents may deviate at any point along their paths. Modeled as a 2-player game, we characterize the structure of Pure Nash Equilibrium (PNE) joint strategies in IC2PP, and show that stable cooperation must follow a highly constrained form. We further prove that at least one PNE exists in every instance of IC2PP, and present a polynomial-time algorithm for enumerating all relevant PNEs. When multiple equilibria arise, we study coordination mechanisms based on bargaining-theoretic selection concepts and empirically compare equilibrium outcomes in terms of individual travel times and social welfare.
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.
VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories
Answering navigation-relevant questions over long egocentric videos requires retrieving and organizing evidence distributed across distant temporal moments while maintaining spatial and contextual consistency. Although long-context vision--language models can achieve strong answer quality, they are computationally expensive for long trajectories and inefficient for repeated querying. Recent graph-based approaches such as VL-KnG address this challenge through persistent spatio-temporal knowledge graphs, but graph-centric retrieval alone may underrepresent broader temporal continuity and contextual cues. We present VL-MemKnG, a hybrid memory framework that extends VL-KnG by combining a spatio-temporal knowledge graph with persistent segment-level contextual memory. The knowledge graph captures structured relational information and long-range object associations, while segment-level memory preserves broader temporal context for long-horizon evidence retrieval. A hybrid retrieval-and-reasoning module jointly operates over both memory representations to produce evidence-grounded answers and temporally organized supporting evidence. We also introduce WalkieKnowledgeT+, an extension of WalkieKnowledge for long-horizon navigation-oriented video question answering. The benchmark includes temporally distributed reasoning tasks requiring evidence aggregation across multiple non-cooccurring moments. On WalkieKnowledgeT+, VL-MemKnG improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55%, outperforming all compared methods, including Gemini 2.5 Pro and Qwen 3.5+. The gains are particularly pronounced on temporal-global and temporally scattered aggregation questions, demonstrating the benefits of combining structured relational memory with segment-level contextual memory while maintaining efficient query-time inference.
Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves nearly twice the aggregate score of learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.
The embodied brain: Bridging the brain, body, and behavior with biorealistic neuromechanical models
Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to understand algorithms for behavioral control. Computational models that embed artificial neural controllers within body models in simulated environments, are a powerful tool for this purpose. Here, we review advances in biorealistic neuromechanical models while also highlighting emerging opportunities ahead. We first show how these models enable inference of biophysical variables that are difficult to measure experimentally. Through systematic perturbation, one can generate new experimentally testable hypotheses through these models. We then examine how neuromechanical models facilitate the exchange between neuroscience, robotics, and machine learning, and showcase their applications in healthcare. We envision that coupling experimental studies with active probing of their neuromechanical surrogates will significantly accelerate progress in neuroscience.
comment: 18 pages, 4 figures (including 1 graphical abstract), 1 table
ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models
Imitation learning (IL) algorithms typically distill demonstrations into parametric policies to mimic expert behavior. However, with limited data and partial observability, such as in egocentric mobile manipulation, existing methods often struggle to generate accurate actions. To address these challenges, we propose ReMoBot, a few-shot, trajectory-conditioned imitation learning framework that directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Leveraging vision foundation models, ReMoBot identifies relevant expert demonstrations by combining state-level similarity, history-aware trajectory alignment, and action-sequence consistency to disambiguate perceptually similar observations. The agent then selects appropriate control commands based on these retrieved demonstrations in a fully training-free manner. We evaluate ReMoBot on three mobile manipulation tasks using a Boston Dynamics Spot robot in both simulation and real-world settings. After benchmarking five approaches in simulation, we compare our method with two baselines trained directly on real-world data without sim-to-real transfer. With only 20 demonstrations per task, ReMoBot outperforms the baselines, achieving high success rates in Table Uncover (70%) and Gap Cover (80%), while also showing promising performance on the more challenging Curtain Open task in the real-world setting. Furthermore, ReMoBot generalizes across varying robot positions, object sizes, and material properties, highlighting its robustness in real-world deformable mobile manipulation. Additional details are available at: https://sites.google.com/view/remobot/home
Artists' Views on Robotics Involvement in Painting Productions
As robotic technologies evolve, their potential in artistic creation becomes an increasingly relevant topic of inquiry. This study explores how professional abstract artists perceive and experience co-creative interactions with an autonomous painting robotic arm. Eight artists engaged in six painting sessions -- three with a human partner, followed by three with the robot -- and subsequently participated in semi-structured interviews analyzed through reflexive thematic analysis. Human-human interactions were described as intuitive, dialogic, and emotionally engaging, whereas human-robot sessions felt more playful and reflective, offering greater autonomy and prompting for novel strategies to overcome the system's limitations. This work offers one of the first empirical investigations into artists' lived experiences with a robot, highlighting the value of long-term engagement and a multidisciplinary approach to human-robot co-creation.
comment: 10 pages, 9 figures, submitted to RAM special issue: Arts and Robotics
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain
Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.
Simplifying ROS2 controllers with a modular architecture for robot-agnostic reference generation
This paper introduces a novel modular architecture for ROS2 that decouples the logic required to acquire, validate, and interpolate references from the control laws that track them. The design includes a dedicated component, named Reference Generator, that receives references, in the form of either single points or trajectories, from external nodes (e.g., planners), and writes single-point references at the controller's sampling period via the existing ros2_control chaining mechanism to downstream controllers. This separation removes duplicated reference-handling code from controllers and improves reusability across robot platforms. We implement two reference generators: one for handling joint-space references and one for Cartesian references, along with a set of new controllers (PD with gravity compensation, Cartesian pose, and admittance controllers) and validate the approach on simulated and real Universal Robots and Franka Emika manipulators. Results show that (i) references are tracked reliably in all tested scenarios, (ii) reference generators reduce duplicated reference-handling code across chained controllers to favor the construction and reuse of complex controller pipelines, and (iii) controller implementations remain focused only on control laws.
comment: 5 pages, 7 figures
CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving ICML 2026
End-to-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary task-level supervision, but applying RL to real-world autonomous driving is challenging in offline settings without interactive simulators, where datasets are dominated by expert actions and provide limited behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples imitation and reward optimization into separate actors to alleviate objective conflicts, uses imagined future rollouts for long-horizon reward estimation, and introduces a competition mechanism that selectively transfers beneficial behaviors while keeping RL anchored to expert-like driving. Experiments on the nuScenes benchmark show that CoIRL-AD consistently improves robustness over strong IL-based baselines, with especially large gains in cross-city generalization and long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.
comment: 19 pages, 22 figures, ICML 2026
Adapting Dijkstra for Buffers and Unlimited Transfers
In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on the London and Switzerland networks show that we can achieve more than a twofold speedup over MR while producing optimal results on both networks, with and without buffer times.
comment: v4: clarified RAPTOR description in the Background section
ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic SC
We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.
comment: 8 pages, 1 figure, 4 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version
A Pragmatic VLA Foundation Model
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 4 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
comment: Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/, GM-100: https://huggingface.co/datasets/robbyant/lingbot-GM-100
An Ergonomic, Customizable Soft Robotic Glove toward Personalized Hand Rehabilitation
Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and user comfort constrain their clinical utility. Here, we present an ergonomic, customizable fabric-based soft robotic glove whose actuators can be tailored to individual finger-joint geometry. The glove comprises five dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby increasing finger contact area and improving comfort. Characterization confirmed joint moment and grasping force sufficient for ADL-relevant tasks. In ten healthy subjects, active assistance significantly reduced forearm muscle activity during manipulation, and a pilot study in three individuals with cervical spinal cord injury showed more natural grasp patterns and reduced reliance on tenodesis grasp.
When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs
Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.
HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments ICRA
Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. We further validate through real-world deployment in a public museum and Expo 2025 Osaka, where it navigates dense pedestrian flows without retraining, demonstrating robust and socially aware behavior. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds. Project code and demos are available at https://github.com/test-bai-cpu/HiCrowd.
comment: 2026 IEEE International Conference on Robotics and Automation (ICRA)
Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.
comment: Transactions on Machine Learning Research (TMLR)
Safe Exploration via Policy Priors
Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
$μ_0$: A Scalable 3D Interaction-Trace World Model
World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.
OmniVTLA: Vision-Tactile-Language-Action Models with Semantic-Aligned Tactile Sensing
Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA. Our ObjTac dataset can be found at https://readerek.github.io/Objtac.github.io
comment: Accepted by IEEE Robotics and Automation Letters (RA-L). ObjTac dataset: https://readerek.github.io/Objtac.github.io
Multi-Robot Motion Planning from Vision and Language using Heat-Inspired Diffusion
Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-conditioned Heat-inspired Diffusion (LHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LHD integrates semantic priors from CLIP, a vision-language model (VLM), with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution (OOD) scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency. Project page is available at: https://jebeom.github.io/lhd_project_page/
comment: 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)
Activity-Dependent Plasticity in Morphogenetically-Grown Recurrent Networks GECCO 2026
Developmental approaches to neural architecture search grow functional networks from compact genomes through self-organisation, but the resulting networks operate with fixed post-growth weights. We characterise Hebbian and anti-Hebbian plasticity across 50,000 morphogenetically grown recurrent controllers (5M+ configurations on CartPole and Acrobot), then test whether co-evolutionary experiments -- where plasticity parameters are encoded in the genome and evolved alongside the developmental architecture -- recover these patterns independently. Our characterisation reveals that (1) anti-Hebbian plasticity significantly outperforms Hebbian for competent networks (Cohen's d = 0.53-0.64), (2) regret (fraction of oracle improvement lost under the best fixed setting) reaches 52-100%, and (3) plasticity's role shifts from fine-tuning to genuine adaptation under non-stationarity. Co-evolution independently discovers these patterns: on CartPole, 70% of runs evolve anti-Hebbian plasticity (p = 0.043); on Acrobot, evolution finds near-zero eta with mixed signs -- exactly matching the characterisation. A random-RNN control shows that anti-Hebbian dominance is generic to small recurrent networks, but the degree of topology-dependence is developmental-specific: regret is 2-6x higher for morphogenetically grown networks than for random graphs with matched topology statistics.
comment: 8 pages, 6 figures. Camera-ready version; accepted at GECCO 2026 Companion (EvoSelf workshop)
EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation
Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.
Raspi$^2$USBL: An open-source Raspberry Pi-Based Passive Inverted Ultra-Short Baseline Positioning System for Underwater Robotics
Precise underwater positioning remains a fundamental challenge for underwater robotics because global navigation satellite system (GNSS) signals cannot penetrate the sea surface. This paper presents Raspi$^2$USBL, a Raspberry Pi-based passive inverted ultra-short baseline (piUSBL) positioning system that provides a low-cost, accessible, and reproducible platform for underwater robotic research. The system consists of a passive acoustic receiver and an active beacon. The receiver integrates a hydrophone array, multichannel preamplifier, oven-controlled crystal oscillator (OCXO), Raspberry Pi 5, and MCC-series data acquisition (DAQ) board. The beacon integrates a matching network, power amplifier, and transmitting transducer. An open-source C++ framework supports clock synchronization and triggering for one-way travel-time (OWTT) messaging, while performing matched filtering, array beamforming, and adaptive gain control to estimate the time of flight (TOF) and direction of arrival (DOA). The system was validated in an anechoic tank, a freshwater lake, and open-sea trials. Results demonstrate a slant-range accuracy better than 0.1%, a bearing accuracy within 0.1°, and stable performance over distances up to 1.3 km. These findings show that low-cost, system-level reproducible hardware can deliver research-grade underwater positioning accuracy. By releasing the software framework and providing a reproducible hardware architecture, Raspi$^2$USBL offers a reference platform that lowers the entry barrier for underwater robotics laboratories and promotes reproducible research in underwater acoustic navigation and swarm robotics.
Neural Minimum-Distance Estimation for Collision-Aware Operation of Multi-Arm Laparoscopy Surgical Robots Through Learning-from-Simulation
This study presents an integrated framework for enhancing the safety and operational efficiency of robotic arms in laparoscopic surgery by addressing minimum distance estimation between multi-arm manipulators and the associated collision-aware warning. By combining analytical modeling, real time simulation, and machine learning, the framework offers a robust solution for ensuring safe robotic operations. An analytical model was developed to estimate the minimum distances between robotic arms based on their joint configurations, offering theoretical calculations that serve as both a validation tool and a benchmark. To complement this, a 3D simulation environment was created to model two 7 DOF Kinova robotic arms (Kinova inc., Boisbriand, QC, Canada), generating a diverse dataset of configurations for distance estimation and collision warning. Using these insights, a deep residual neural network model was trained with joint configurations as inputs. On the held out validation set, the model achieves R2 = 0.940, RMSE = 42.0 mm, MAE = 28.7 mm, and a near zero mean bias, demonstrating strong predictive accuracy and consistent generalization across the workspace. The framework is intended as an early collision warning layer, where a warning is triggered when the predicted inter-arm distance falls below a 0.2 m threshold, which corresponds to a surface to surface clearance of approximately 50 mm given the Kinova Gen3 (Kinova inc., Boisbriand, QC, Canada) cross sectional radius. This work demonstrates the effectiveness of combining analytical modeling with machine learning to enhance the precision and reliability of multi-arm robotic systems.
Human Cognition in Machines: A Unified Perspective of World Models
This report of world models distinguishes prior works by the cognitive functions they innovate. Many works claim an almost human-like cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles from human and machine cognition theory. In moving towards human-like world models we present a conceptual unified framework for world models that fully incorporates all the cognitive functions (i.e., memory, perception, language, reasoning, imagining, motivation, and metacognition) and identify gaps in existing research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and metacognition remain drastically under-researched, and we propose concrete directions to address these gaps informed by active inference and global workspace theory. We also introduce epistemic world models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied to video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation
Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP
Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of the sequential and parallel Riccati recursions for solving dual-regularized linear-quadratic regulator (LQR) problems, with $O(N)$ sequential time and $O(\log(N))$ parallel time, respectively. We show that these subproblems arise when using regularized primal-dual interior-point methods to solve smooth, constrained, non-convex, discrete-time optimal control problems via multiple-shooting, even in the presence of stagewise equality or inequality constraints, and without imposing any rank requirements on constraint Jacobians. We prove that, when certain inertia conditions on the Newton-KKT matrix are met, each nonzero primal step is a descent direction of an augmented barrier-Lagrangian merit function. We characterize these inertia conditions in terms of the positive-definiteness of the dual-regularized Riccati pivots (a weaker condition than the standard LQR positive-definiteness requirements), thereby yielding inexpensive certificates of the required inertia. We provide MIT-licensed implementations of our methods in C++ and in JAX, as well as a full formalization of our results in Lean. We benchmark our algorithm against leading optimal control and nonlinear programming solvers on complex trajectory optimization problems, establishing competitive performance on moderate problems and substantial gains as the horizon length, problem dimension, and constraint count increase.
GenZ-LIO: Generalizable LiDAR-Inertial Odometry Beyond Confined--Open Boundaries
For field robotic missions such as inspection, search-and-rescue, and exploration, light detection and ranging (LiDAR)-inertial odometry (LIO) can serve as a core component of autonomy by providing localization and mapping in GNSS-denied or unstructured environments. However, transitions between confined and open spaces, which are commonly encountered in field deployments, can induce substantial changes in scan density and local geometric structure, thereby reducing the robustness and computational efficiency of LIO. To address these issues, we present GenZ-LIO, a generalizable LIO framework designed to adapt to variations in spatial scale across confined and open environments. GenZ-LIO comprises three components: (i) scale-aware adaptive voxelization for regulating scan downsampling across spatial scale changes, (ii) hybrid-metric state update for combining point-to-plane and point-to-point residuals under varying geometric structure, and (iii) voxel-pruned correspondence search for efficient point-to-point matching. We conduct a comprehensive evaluation using 42 sequences from nine public datasets and our newly collected NarrowWide dataset to analyze LIO performance under spatial scale variations across diverse field scenarios. Across the evaluated sequences, GenZ-LIO maintains stable odometry estimation without divergence, indicating practical robustness under the tested field conditions. The source code and collected dataset will be made publicly available upon publication.
comment: 21 pages, 12 figures
Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation
Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.
comment: Work in progress. Project website at https://zjunlp.github.io/LabVLA/
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.
comment: Project website: https://omniretarget.github.io
ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control
Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.
Planning with the Views
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.
Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned
Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.
Multiagent Systems
Human-on-the-Bridge: Scalable Evaluation for AI Agents
AI agents must be evaluated as behavioral systems, not as isolated response generators. They reason across turns, call tools, preserve context, follow policies, and act under uncertainty. Existing methods provide useful but fragmented signals: benchmarks measure fixed capabilities, Human-in-the-Loop review preserves expert judgment but does not scale easily, LLM-as-judge methods depend on evaluator design, red teaming is often episodic, and trace auditing requires explicit evidence rules. This paper introduces Human-on-the-Bridge (HOB), a scalable evaluation paradigm for agentic AI. HOB places human expertise upstream, where experts curate reusable evaluation intelligence before testing begins, including domain context, Red-Team Traps, Juror Personas, scoring guidelines, audit rules, and fallback policies. ProofAgent Harness then executes this curated intelligence repeatedly through multi-turn adversarial evaluations, trace capture, multi-juror scoring, and evidence-linked reporting. We evaluate HOB through symmetric and cost-efficient asymmetric settings across frontier LLM-based agents and Harness LLM tiers. The study covers 23,500 agent turns and produces evidence-linked findings across finance, healthcare, and code generation. The results show that HOB can amplify evaluation quality without requiring equally large evaluator models, allowing smaller Harness LLMs to challenge agents built on frontier LLM backbones. The evaluation surfaces failures often missed by static benchmarks and single-evaluator scoring, including phantom tool-call claims, missing mandatory tool calls, policy drift, manipulation paths, and safe but non-resolving refusals. These findings support HOB as a paradigm for scaling human-curated evaluation intelligence, where expert judgment is encoded upfront and reused across repeated agent evaluations rather than applied manually inside every run.
comment: 33 pages, 3 figures
Misinformation Propagation in Benign Multi-Agent Systems
Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.
comment: 20 pages, 8 figures, 1 table
The Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs
Agents increasingly access large language models (LLMs) through API routers. A router terminates the client's transport-layer security session and opens a separate upstream session, so it holds the full interaction in plaintext. This makes the router an application-layer man-in-the-middle: it can rewrite agent tool calls, swap dependencies for typosquatted packages, trigger attacks only under audit-evading conditions, and passively exfiltrate secrets. Existing client-side defenses are evadable. We propose AEGIS, a provider-transparent attested API router whose data path is a client-verified faithful passthrough. AEGISconfines plaintext handling to a small hardware-enclave component while leaving authentication, scheduling, accounting, and management on the untrusted host. The client verifies the enclave before releasing plaintext. The host can neither read nor alter the interaction, and plaintext leaves only toward destinations fixed by the measured image. We show that all four malicious-router attack classes succeed against a plaintext-access baseline and are blocked by AEGIS, including adaptive tests against the same boundary. The trusted path is $851$ lines, carries three provider-native APIs without conversion, and completes every request under real-provider workload and concurrency. In a seeded audit pilot, two commodity coding agents find eight and ten of ten planted invariant violations. The local relay overhead is about six milliseconds per request.
Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints
This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.
TokenPilot: Cache-Efficient Context Management for LLM Agents
As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.
comment: LightMem Series: Work in Progress
GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence
Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.
comment: 28 pages, 11 Figures
Intermittent Strategic Cooperation of Two Selfish Agents on Graphs
We study strategic space- and time-constrained cooperation between two self-interested agents through the Intermittent Strategic Cooperation-Based Two-Agent Path Planning (IC2PP) problem, a shortest-path game on graphs in which agents navigate toward individual targets while optionally cooperating at specific nodes to reduce their own travel times. Although such cooperation can strictly benefit both agents, it is strategically fragile: agents may deviate at any point along their paths. Modeled as a 2-player game, we characterize the structure of Pure Nash Equilibrium (PNE) joint strategies in IC2PP, and show that stable cooperation must follow a highly constrained form. We further prove that at least one PNE exists in every instance of IC2PP, and present a polynomial-time algorithm for enumerating all relevant PNEs. When multiple equilibria arise, we study coordination mechanisms based on bargaining-theoretic selection concepts and empirically compare equilibrium outcomes in terms of individual travel times and social welfare.
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.
comment: 32 pages, 2 figures, 6 tables. Verus/TLA+ verification artifact, reference Rust runtime, and Python harnesses, plus a supplementary appendix (Sections A-F, Tables S1-S6), included as ancillary files
From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities EMNLP 2026
While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.
comment: Submitted for review in ARR for EMNLP 2026
MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision
Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory's ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.
comment: Code, website, project page, and video are linked in the paper
BARD-MARL: Byzantine-Agent Detection for Learned Communication in Multi-Agent Reinforcement Learning
Learned communication improves coordination in cooperative multi-agent reinforcement learning, but it also creates a trust problem: a trained policy may route information through agents that have become faulty or adversarial. This paper studies Byzantine-agent detection for learned-communication MARL in adaptive traffic signal control. We propose BARD-MARL, a post-hoc diagnostic layer on top of BayesG, which is used as an attributed communication substrate rather than as a contribution of this paper. BARD-MARL combines two agent-level evidence streams: policy-graph features extracted from state-action trajectories and Bayesian trust statistics computed from BayesG latent mask probabilities. Across fixed-action, observation-flip, random-noise, and coordinated attacks in SUMO traffic grids, the results show that these signals are complementary rather than universally dominant. On a 25-agent grid, BARD-MARL reaches 0.843 AUC-ROC under a 10% observation-flip attack, while policy-graph-only detection reaches 0.917 AUC-ROC under a 10% coordinated attack. On a 100-agent grid, the unified BARD-MARL variant reaches 0.982 AUC-ROC for both 10% fixed-action and 10% coordinated attacks. The study shows that learned communication policies expose useful diagnostic evidence, but credible resilience claims require attack-specific ablations and explicit separation between coordination, detection, and mitigation.
comment: 8 pages, 4 figures; arXiv preprint; ancillary reproducibility/code bundle included
Machine-Coached Policy Revision in Adaptive Agent-Based Regulatory Simulation: A Controller-Level Contestability Layer
Policy-oriented agent-based models are increasingly used to study regulatory interventions in complex adaptive socio-technical systems. Recent adaptive ABM frameworks distinguish between static and adaptive agents, fixed and adaptive policies, and alternative controller designs. However, most diagnostic workflows remain ex post: trajectories are analysed after simulation, but the resulting evidence is not systematically fed back into the policy controller. This paper proposes a lightweight machine-coached policy-revision layer for adaptive agent-based regulation. The layer represents policy decisions as defeasible rules with explicit conflicts and priorities, generates explanations for controller actions, and allows diagnostic failures to be translated into rule additions, removals, or priority changes. The contribution is not a new optimal controller and does not claim formal guarantees for unrestricted machine coaching. Instead, it provides a simulation-compatible operationalization of controller-level contestability: policy decisions can be explained, challenged, revised, and re-evaluated in held-out simulation runs. A stylized emissions-regulation ABM is used as the experimental component. A controlled simulation experiment focuses on an over-conservatism failure in the VPVA regime. The predefined coaching template adds a relaxation rule to the symbolic controller, reducing over-conservatism recurrence under held-out seeds while preserving violation, overshoot, and volatility guardrails. The paper argues that machine coaching is best understood as a controller-level extension of explainable adaptive ABM, complementary to causal, information-theoretic, and trajectory-based diagnostics.
comment: 26 pages, 2 figures, 14 tables. Methodological study of machine-coached policy-controller revision in adaptive agent-based regulatory simulation
Structural Distinguishability of Static and Adaptive Policy Regimes in Agent-Based Regulatory Simulation
Agent-based models are widely used to evaluate policy interventions in complex socio-technical systems, yet many policy-oriented ABMs represent regulation as a fixed scenario parameter. This limits their ability to distinguish whether regulatory conclusions depend on agent adaptation, policy adaptation, or the interaction between both. Building on a previously proposed four-regime architecture, this paper contributes a controlled simulation benchmark rather than a new general framework. Using a single configurable emissions-regulation ABM, we compare constant policy/constant agents, constant policy/adaptive agents, adaptive policy/constant agents, and adaptive policy/adaptive agents under matched simulation conditions. We evaluate naive fixed policies, tracking-aware calibrated fixed policies, and three adaptive controllers: setpoint, safety-margin, and one-sided control. The benchmark recovers expected controller archetypes: setpoint control tracks the cap but produces frequent boundary crossings, safety-margin control reduces violations through conservatism, and one-sided control can limit violations but may ratchet toward over-conservatism when combined with adaptive agents. The contribution is methodological: scalar indicators, cap-relative symbolic diagnostics, trajectory motifs, and visual inspection jointly reveal how regulatory conclusions can differ even when average outcomes appear similar. Adaptive policy-oriented ABMs should therefore be evaluated through regime distinguishability, not only through average performance.
comment: 19 pages, 4 figures, 9 tables. Simulation-based methodological study
How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks
Multi-agent LLM coordination papers report small benchmark deltas as evidence that one architecture beats another. A prior question: how much paired trial-0 disagreement do two protocols produce on the same model and benchmark when their API inputs are configuration-equivalent (matched by code inspection plus a SHA-256 byte audit), short of full identity-replay? On Claude Haiku 4.5 against tau^2-bench retail, the clean configuration-equivalent contrast (no_coord vs. intercept, both inert at trial 0) gives signed paired gaps of +10pp and 0pp across two n=100 seeds; pooled across both, +5pp with Wilson CI [-2,+12], not significant. The largest single-seed contrast (+18pp pull-vs-intercept, p_corr=0.012) did not reproduce at the second seed (-3pp, p_corr=1.0); no trial-0 contrast is significant after Bonferroni at either seed or pooled. The envelope of observed paired gaps spans [-3,+18]pp across two seeds, with pooled upper Wilson CI ~15pp. Seven of ten recent multi-agent coordination architectures report headline effects below this local floor, and one more sits inside the envelope; whether they survive a same-model paired replication is, by construction, untested in their original settings. We define coordination-active pass^k, pass^k restricted to trials where the coordination mechanism is logically active, as the minimum reporting protocol, with sample-size targets and runtime hooks in the body. Measurements run on ET-MCP, a task-scoped negative-knowledge store conformant with MCP 2026-07-28, used as a substrate to isolate reader-side choices, not as a contribution. On Haiku 4.5 the candidate readers (pull, intercept) do not improve trial-1 recovery; we give a preliminary diagnosis of failure modes with refinements on existing production hook surfaces.
comment: 11 pages, 4 figures
Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection
Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.
ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic SC
We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.
comment: 8 pages, 1 figure, 4 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code is publicly available at https://github.com/Afterglow231/MARS_preview .
comment: 14 pages, 13 figures. Preprint
Shachi: A Modular, Controllable Framework for LLM-Based Agent-Based Modeling of Emergent Collective Behavior
How collective behaviors emerge from the interactions of individual LLM-driven agents is a central question in artificial life, yet controlled study of these emergent dynamics has been hindered by the lack of a principled simulation framework for systematic experimentation. To address this, we introduce Shachi, a principled methodology and modular framework that decomposes an agent's cognition into core components: Configuration for intrinsic identity, Memory for contextual continuity, and Tools for extended capabilities, all orchestrated by an LLM reasoning engine. This decomposition treats each cognitive component as an independently controllable variable, enabling perturbation studies that trace how micro-level cognitive traits propagate into population-level dynamics. We investigate behavioral patterns across a 10-task benchmark spanning three levels of collective complexity. Shachi enables memory transfer across environment transitions, producing history-dependent behavioral shifts, and allows agents to simultaneously inhabit multiple environments, revealing cross-environment interference invisible in single-environment studies. Furthermore, in a real-world U.S. tariff shock case study, locally interacting agents with individually controlled cognitive components produce macro-level market dynamics directionally consistent with observed real-world outcomes. Our work provides a rigorous, open-source simulation framework for LLM-based ABM, aimed at fostering cumulative scientific inquiry into the emergent collective behaviors of interacting artificial agents.
comment: Accepted to ALIFE 2026
Moving Out: Physically-grounded Human-AI Collaboration ICML 2026
The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. However, most existing collaboration benchmarks are discrete or do not consider physical attributes and constraints. To address this, we introduce Moving Out, a human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and coordinating actions to move an item around a corner. Moving Out consists of two challenges and human-human interaction data to comprehensively evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To give embodied agents the capability to collaborate with humans under physical attributes and constraints, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. We systematically compare BASS and state-of-the-art models in AI-AI and human-AI experiments, showing that BASS can effectively collaborate with both unseen AI and humans. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.
comment: Accepted at ICML 2026
Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models
Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.
Systems and Control (EESS)
Learning practically stabilizing output-feedback nonlinear controllers
This paper addresses the problem of learning an output-feedback surrogate controller offline that approximates a given, possibly computationally expensive, nonlinear controller-observer pair. The surrogate is modeled as a recurrent dynamical system and is trained to imitate closed-loop input/output trajectories generated by the given controller. Beyond imitation accuracy, the offline training problem promotes input-to-state practical stability by incorporating estimated state trajectories to learn a candidate Lyapunov function. The approach is validated on a nonlinear continuous stirred tank reactor, where constraint satisfaction and practical stability are assessed through a probabilistic validation approach. The numerical results highlight the benefit of jointly learning the Lyapunov function by comparing against an imitation-only baseline.
comment: 8 pages, 5 figures
Enhancing Secret Key Generation for UAV Communications via Codeword Reconstruction
With the rapid advancement of unmanned aerial vehicle (UAV), ensuring the security of communication links among UAVs has become crucial. In this paper, we propose a novel physical layer key generation scheme based on channel codeword reconstruction. In UAV communications, the high mobility of aerial nodes leads to short channel coherence time, which together with noise causes inevitable channel estimation errors. These errors significantly degrades the performance of wireless channel-based key generation. Therefore, we propose a codeword construction algorithm that achieves a polarization characteristic, which effectively segregates reliable keys from unreliable ones. Compared to the existing quantization-based key generation scheme, our approach maximize the utilization of raw channel information and employ soft-decision decoding to generate key. Simulation results demonstrate that the proposed scheme reduces the key disagreement rate for legitimate users and increases the number of consistently generated keys. Furthermore, our method ensures a lower key consistency rate for eavesdropper, which guarantees system security.
comment: Accepted and presented at an IEEE Wireless Communications and Networking Conference (WCNC) 2026 Workshop. 6 pages, 7 figures
Closed-loop Optimal Fault Detection for Uncertain Systems
Faults compromise the reliability and safety of complex engineering systems. The aim of this article is to address the problem of robust fault detection filter design for continuous-time linear time-invariant uncertain systems in open-loop or closed-loop configurations. The developed method offers a unified approach to handle parametric and dynamic uncertainties by solving a single Riccati equation, based on a worst-case disturbance and uncertainty model. This worst-case model is obtained by nonlinear optimization and application of the boundary Nevanlinna-Pick method. The efficacy of the proposed approach is demonstrated using an uncertain model of an experimental reticle stage used in the lithography industry. The results illustrate that an optimal compromise is achieved between sensitivity to faults and rejection of modelling uncertainties and disturbances on the other hand. This capability enables the clear differentiation between faults and undesired effects in residuals, thereby enhancing fault detection reliability, ultimately contributing to improved safety and performance of machines.
Exponential Weighting Model Predictive Control with Observer for Modular Multilevel Converters
In this article, we propose a model predictive control (MPC) scheme with an exponential cost function, along with an observer for the Modular Multilevel Converter (MMC), to enhance converter dynamic performance. In particular, as the prediction horizon $(N_P)$ increases, the numerical conditioning deteriorates rapidly, especially when a large $N_P$ is employed. This research work uses an appropriate cost function weighted to overcome the limitations of a large $N_P$. We further analyse the effects of constraints, observing that the designed MPC strictly adheres to them and that the control variable influences the MMC plant's response. The presence of the observer improves the prediction of the output, particularly for setpoint changes in the reference signal. We also analyze the prescribed performance, which provides a priori guarantees of closed-loop stability for the proposed controller.
comment: 6 pages
TNODEV: Toolbox for Neural ODE Verification
Neural ordinary differential equations (neural ODE) have started to appear in safety critical settings such as continuous-time controllers for cyber-physical systems and classifiers integrated into automated decision pipelines, raising the question of whether their behavior can be formally verified. Existing tools dedicated to neural ODE provide only a single reachability call without iterative input set refinement, limiting the precision of their verdicts to whatever one reachability call can deliver. We present TNODEV, the first sound formal verifier for neural ODE that integrates a falsification checker, a fast interval-based reachability backend based on continuous-time mixed monotonicity, a verification and refinement loop with three input-set splitting heuristics, and a parallel scheduler in a single end-to-end pipeline. TNODEV supports safe-set inclusion verification on pure neural ODE, neural ODE in closed loop with a neural network controller and general neural ODE (GNODE), with the safe set specified either as an interval or as the half-space intersection induced by a target classification label. We evaluate TNODEV on a range of benchmarks across safe-set inclusion and classification-robustness properties, including a direct reachability comparison against NNV~2.0 and CORA and a verification comparison against NNV2.0 on MNIST general neural ODE classifiers.
comment: 29 pages, 7 figures, Under review in TMLR
ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning SC
Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.
comment: 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version
On the Lyapunov equation with the state matrix in companion form
We study the continuous-time Lyapunov equation under the assumption that the state matrix is a Hurwitz companion matrix. The standard Lyapunov theory implies that the unique solution $X$ is positive semidefinite. Motivated by positive systems, we investigate the question of whether $X$ is entrywise nonnegative. We prove that this is the case when the companion matrix has only real eigenvalues. The proof reduces each entry of $X$ to a quadratic form associated with a class of Cauchy-like matrices whose entries are expressed in terms of elementary symmetric polynomials. The required nonnegativity then follows from the positive semidefiniteness of these Cauchy-like matrices. We also discuss a stronger total-positivity property: total nonnegativity does not hold in general, but it is recovered under an additional sign condition on the expansion of the forcing vector in the eigenbasis of $A^\top$.
comment: 13 pages, no figures
Can Optimal Dispatch Models Recreate Reality? A Retrospective Analysis of Europe's 2022 Energy Crisis Using PyPSA-Eur
Electricity prices result from the complex interplay of supply and demand, which depends on variable renewable energy production, fuel costs, CO$_2$ price, and grid bottlenecks.Between 2020 and 2024, COVID-19 shifted demand and disrupted supply chains and operations in Europe, while Russia's invasion of Ukraine constrained gas supply, causing exceptional volatility in gas prices. It remains unclear whether optimal dispatch models can reliably replicate historical hourly prices during crises, given the rapid fluctuations in fuel prices and operators' limited foresight. In this work, we ask whether an optimal dispatch model, parametrised with historical data on demand, fuel, and CO$_2$ prices, can reproduce the observed market outcomes during this period. We perform hourly hindcasts of electricity generation in 35 countries from 2020 to 2024 using PyPSA-Eur and compare the nodal marginal electricity prices with historical ENTSO-E market prices using the Symmetric Mean Absolute Percentage Error (SMAPE). The scenarios compare static vs dynamic assumptions on fuel and CO$_2$ prices, as well as perfect foresight vs a two-week rolling-horizon optimization. Combining high-resolution fuel and CO$_2$ price time series with limited-foresight substantially improves hindcast accuracy, yielding an average SMAPE of 20.76% based on the daily load-weighted average price for the entire Europe. While improvements relative to the scenario with perfect foresight and static price inputs occur, they are most pronounced during periods of high fuel-price volatility, when marginal-cost swings propagate to electricity prices. In 2022, the optimal generation mix in most countries shows substantially less natural gas and more coal than historically observed. Other discrepancies can be attributed to the model's omission of real-world policy interventions, other dispatch constraints, and generator outages.
comment: 5 pages, 3 figures, Conference Paper EEM26
Robust Koopman MPC with Sets Updates for Time Delayed Systems
Koopman operators have shown significant potential in designing linear model predictive control (MPC) schemes for nonlinear systems on a lifted observable space. Recent advances have tackled the robust Koopman MPC design issue in the presence of modeling errors, relying on the prior estimation of the modeling uncertainty set. However, deriving a robust positively invariant set using a precalculated uncertainty set can be conservative because the uncertainty set bound is time-varying and dependent on the state and control. Additionally, no existing Koopman MPC design has addressed the closed-loop robustness challenge for nonlinear time delayed systems. Thereby, this article presents a robust adaptive Koopman MPC approach with online updates of uncertainty sets for a class of nonlinear time delayed systems. The unknown nonlinear time delayed system is first modeled in a data-driven manner to derive a lifted time delayed Koopman model in the feature space. By analyzing fundamental properties such as controllability and observability, a robust tube-based MPC algorithm is designed for the time delayed Koopman model. The robust adaptive Koopman MPC algorithm with online updates of the uncertainty sets is then presented to reduce conservatism. Closed-loop robustness under exogenous disturbances and asymptotic convergence in the nominal scenario are proven. Finally, numerical examples verify the effectiveness of the proposed approach.
comment: This paper is submitted to Automatica
HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization
Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.
GreenBox: Prototyping of an Automatic Road Accident Detection System with Real-Time Notification SMS
The Internet of Things (IoT) project, called "GreenBox", proposes the development of a prototype for the detection of road accidents, using sensors and actuators connected to an Arduino Uno and a Global System for Mobile Communications (GSM) card with 4G support, SIM7600G-H (global version) from DF-Robot. This system sends Short Message Service (SMS) to pre-established contacts, alerting, for example, family or friends, so that they immediately contact emergency entities in the event of an accident and provide them with all the necessary information, such as the location of the vehicle. The sensors include four push-buttons, with a resistance of 10kΩ, in order to define their default logical state, representing impact or impact sensors for each side of the vehicle; two water level sensors for the engine compartment and trunk; and a gyroscope/accelerometer to detectrollover from a 70 degrees inclination. The prototype also has a relay that is activated to turn off the engine in the event of a detected accident, preventing further damage. The GSM board has a Global Positioning System (GPS) antenna attached, allowing it to locate the vehicle and determine its speed, moments before the possible accident. The system also allows you to turn off the vehicle via SMS in case of theft. The entire project is prepared for the owner, for example the driver of the vehicle, to cancel all automation, via SMS, in case of false accident detection. Rollover detection is calculated using the arctangent of the accelerometer values and instructions for sending notifications are carried out by AT (Attention Commands) commands, between the microcontroller and the SIM7600 shield.
Transient-Safe Platooning via Dynamic Headway
Managing autonomous vehicle platoons requires a delicate balance between string stability and rigorous safety. This challenge is intensified by aggressive transients, such as highway merging. Although Constant Time Headway (CTH) spacing is the industry standard for Cooperative Adaptive Cruise Control, it lacks formal safety guarantees during significant velocity deviations. This letter proposes a computationally efficient control framework that considers a linear time-invariant model for the dynamics of each vehicle, while ensuring formal transient safety and stability. By introducing a spacing policy that naturally converges to CTH at steady state, we establish platoon safety as an inductive property. We derive a non-linear and saturated control law for the lead follower and provide sufficient initial conditions to guarantee velocity non-negativity and safety throughout the platoon for any CTH-based followers' control law. Numerical examples indicate the proposed methodology may be applicable even under non-nominal setups.
Output-Feedback Boundary Control of Reaction--Diffusion PDEs on Arbitrary Lipschitz Domains: A Target-Domain Approach
We present a domain-extension framework for output-feedback boundary stabilization of reaction-diffusion equations on arbitrary bounded Lipschitz domains, including non-convex and multiply connected geometries. The plant is posed on an irregular domain whose boundary has actuated and uncontrolled portions. Just as backstepping transforms the plant dynamics into a stable target system, the method embeds the plant in a target domain, such as a ball or a rectangle, where a stabilizing design is already known. Every boundary portion through which the extension proceeds must carry actuation and the complementary collocated measurement. Uncontrolled portions are allowed when they are shared with the target boundary and have the same boundary-condition type. The gap between the two domains is filled with a virtual copy of the plant dynamics, coupled to the plant through interface conditions, and the concatenated state evolves exactly as the known closed loop on the target domain. Well-posedness and exponential stability of the physical state follow by restriction. The offline design data are inherited from the target design and are closed-form for constant-coefficient plants on balls and rectangles. Online simulation of the virtual PDE has the same computational character as a full-order PDE observer, a standard component of output-feedback designs. A new explicit Neumann-actuated backstepping law on n-balls enlarges the available target designs. Output feedback is obtained by lifting the target-domain observer, driven by a collocated interface measurement relayed through the virtual domain. Numerical experiments on star-shaped, horseshoe, and multiply connected domains, with a partitioned plant/controller implementation and a shared-wall cavity, test the designs.
comment: Preprint submitted to IEEE Transactions on Automatic Control
Sustainable Heating with Karma: A Simulation Study of the KTH Live-In Lab
Space heating in buildings accounts for 10% of the global CO2 footprint. The widespread adoption of energy-efficient heating technology, e.g., heat pumps, could help reduce this figure, but technology alone may not suffice to reach carbon neutrality. Additionally, human occupants have an important role to play by adopting sustainable heating behaviors, e.g., avoid excessive window opening in the winter or (pre-)heat their units while clean energy is abundant. Thus far demand response policies aimed at promoting these behaviors have been monetary, which discriminates against low-income households and exposes human occupants who do not actively engage with real-time control signals to financial risks. This paper instead investigates the suitability of a non-monetary karma economy for promoting sustainable heating behaviors. Karma leverages the repeated and dynamic nature of heating energy allocations to attain climate targets both fairly and efficiently over time without resorting to financial means. As a first step towards experimentally validating the karma concept with real human occupants in the KTH Live-In Lab, we perform a simulation study on a digital model of the Live-In Lab. The study provides initial estimates of expected effects to guide the design of human-in-the-loop experiments, as well as assists with designing and tuning the karma economy in this context. As a specific example, we investigate how incorporating consumption memory in the form of karma affects window opening behaviors in comparison to conventional memory-less heating operation.
comment: 6 pages, 5 figures,
Extended Kalman Filter-Based State Estimation for a Nine-Compartment Nonlinear Epidemic Model -- Convergence Analysis and In-Silico Benchmark Calibrated on the COVID-19 Third Wave in Italy
This paper addresses real-time state estimation for a nine-compartment nonlinear COVID-19 epidemic model with two co-circulating strains, a super-spreader subpopulation, vaccination with waning immunity, hospitalization, and mortality. Time-varying transmission and vaccination rates are known inputs from a companion calibration, leaving the reconstruction of all nine states from three routinely reported observables: hospitalizations H, fatalities F, and vaccinated stock V. The contributions are theoretical rather than in the filter recursion. First, a Lie-derivative observability analysis yields, via a six-step derivation, the closed-form determinant |det(O9)| = delta_w * gamma_a^2 * kappa * rho2 * w1^2 * (delta_i - delta_p)^2 * |r1 - r2|, showing the level-2 codistribution is rank-deficient at the calibrated symmetric parameters (delta_i = delta_p, r1 = r2); the third Lie derivative restores full rank 9, with r2 the symmetry-breaking parameter. Second, an EKF is designed on the Euler-discretized dynamics with a closed-form 9x9 Jacobian and Joseph covariance update. Third, local exponential mean-square boundedness of the error is proved as a full theorem via the Reif-Gunther-Yaz-Unbehauen hypotheses, exploiting the bilinear drift and linear output to obtain a global-radius quadratic remainder bound that extends to bilinear-drift, linear-output systems. Fourth, the noise covariances are designed from calibration residuals and assessed by NEES and innovation-whiteness tests. All experiments use synthetic measurements from the calibrated model, so reported RMSE values (0.07%-2.72%) are methodology benchmarks, not predictive accuracy. A parameter-mismatch study shows measured and directly-coupled channels stay accurate under model error up to +/-30% while indirectly observed states degrade gracefully. The framework provides the state-feedback basis for future Model Predictive Control.
comment: Companion paper to arXiv:2606.07413. Submitted to ISA Transactions. 9 figures
An Adjoint-based Neural Regulator for Real-Time Optimal Control with State Constraints
This paper introduces a learning-based control framework for real-time constrained optimal control of nonlinear systems with safety guarantees based on the Pontryagin's Minimum Principle. The approach learns a neural co-state (adjoint) policy that encodes optimality through the system Hamiltonian, rather than directly approximating a control law. Feasibility is enforced separately at runtime through an efficient convex projection that incorporates actuator limits and safety constraints expressed as control barrier functions. We refer to this framework as an adjoint-based neural regulator (ANR) as it yields a controller that satisfies constraints while retaining the optimality structure encoded by the learned adjoint. We demonstrate the effectiveness of the proposed framework on nonlinear constrained control tasks using a unicycle model. The ANR achieves performance at par with nonlinear model predictive control at more than two orders of magnitude lower computational cost, while exhibiting near-invariant performance across unseen scenarios, thus, significantly outperforming reinforcement learning methods in out-of-training-distribution regimes.
Geometry-Driven Islanding Detection and Fault Classification for Grid-Forming Inverters: A Normally Hyperbolic Invariant Manifold Framework with Physics-Derived Thresholds
This paper presents a geometry-driven detection and fault-classification framework for grid-forming (GFM) inverters based on normally hyperbolic invariant manifolds (NAIM) and stochastic hypothesis testing. The GFM droop manifold $\mathcal{M}_0$ is identified as a NAIM of the closed-loop dynamics. Transverse fluctuations under grid noise are modeled as an Ornstein--Uhlenbeck process, and the long-run covariance is obtained from the algebraic Lyapunov equation. The detection statistic $D_t=T_w\barξ_{\perp}^{\top}Σ_{\mathrm{long}}^{-1}\barξ_{\perp}$ converges to $χ^2(2)$ under the null hypothesis, yielding the tuning-free threshold $D_α=-2\lnα$ and an asymptotically exact false-alarm rate $α$. A factor-of-2 error in earlier formulations is corrected and validated using 8,000 Monte Carlo realizations over nine window lengths and three significance levels. The Berry--Esseen bound $d_{\mathrm{KS}}\leq1.6704/(βT_w)$ is confirmed empirically. The minimum window condition $T_w\geq10/β_{\min}\approx1.0$ s, where $β_{\min}=\min(ω_f,ω_v)$, satisfies the IEEE 1547-2018 two-second detection requirement. A co-design theorem shows that increasing $(ω_f,ω_v)$ simultaneously enlarges the Fenichel spectral gap, tightens the null covariance, and reduces the false-alarm rate. Modal decomposition separates frequency and voltage contributions, enabling classification of islanding and voltage faults without additional sensors. Case studies confirm correct acceptance of normal operation, rapid detection of soft islanding, and accurate identification of a 10\% voltage sag.
Data-driven Control with Real-time Uncertainty Compensation for Multi-Fuel Engines
Multi-fuel compression ignition (CI) engines offer superior power density and fuel flexibility. However, achieving consistent and optimal combustion phasing across a wide range of operating conditions remains a major challenge, particularly in the presence of modeling uncertainties. This paper presents a novel, data-driven real-time uncertainty compensation framework for combustion control in multi-fuel CI engines. The proposed approach introduces a pseudo-engine speed that enables dynamic adaptation of control inputs in response to uncertainty affecting the engine. To model the underlying combustion process, a Gaussian Process Regression (GPR) model is first trained on available input-output data, capturing the nonlinear and fuel-dependent behavior across varying operating conditions. Control inputs are then synthesized through model inversion of the learned GPR surrogate and augmented with an uncertainty compensator designed to mitigate deviations caused by dynamic variations in operating conditions and model inaccuracies. This integrated control strategy allows for real-time input corrections within a finite number of combustion cycles. Theoretical analysis establishes finite-time convergence guarantees for the proposed controller. Simulation results demonstrate that the proposed method steers the combustion phasing to the desired value in real-time, providing a scalable and adaptive control solution for multi-fuel CI engine operation.
Geospatial sensitivity of transmission-constrained ACOPF to generator retirement
The US faces a growing resource adequacy challenge: new loads are being added at unprecedented scale while aging generating assets are being retired. In transmission-constrained grids, it is difficult to determine which units can be safely retired and which cannot be retired and instead require lifetime extensions until new generation can be built. Historically, this analysis was prohibitively time consuming. Transmission-constrained AC optimal power flow (ACOPF) is computationally intensive, and a thorough comparison and prioritization of generators could require hundreds or thousands of scenarios. We present an HPC-enabled framework that enables computation and geospatial mapping of the effects of generator retirement in terms of voltage magnitude and angle effects in the steady state. Specifically, our framework detects the effects of generator retirement using a simple k-nearest-neighbors model and a voltage-class-adjusted neighbor model. We demonstrate the results on over 8,000 generator retirement scenarios for a 70,000-bus transmission-constrained synthetic grid.
Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints
This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.
Optimal Bounded Thrust Powered Descent with Analytical Ground-Collision Avoidance
The paper proposes a new approach to address the bounded-thrust powered-descent problem while ensuring ground-collision avoidance. A time-dependent polynomial approximation of the mass is employed to formulate a bounded linear-quadratic optimal-control problem that minimizes the thrust-acceleration control effort, terminal miss, and terminal velocity error. The resulting approximation is used to impose a hard constraint on the horizontal thrust profile while keeping the vertical thrust profile unconstrained. The key idea is a hierarchical separation of the thrust allocation, which enables analytical ground-collision avoidance under bounded thrust. Unlike existing bounded-thrust powered-descent approaches based on numerical optimization and trajectory-shaping constraints, the proposed method provides explicit analytical collision-avoidance conditions. Building on this formulation, the guidance law predicts the switching times between saturated and unsaturated arcs and shapes the thrust-acceleration profile to achieve a soft landing, even when the controller remains saturated over extended portions of the trajectory. Owing to its analytical nature, the guidance law is computationally efficient, and its continuous thrust profile facilitates real-time implementation. The proposed method was evaluated over a grid of perturbed initial conditions in realistic simulations, demonstrating accurate collision-free soft-landing performance. The results highlight the importance of combining saturation-aware guidance with ground-collision avoidance under bounded thrust.
comment: This work has been submitted for journal publication. 32 pages and 15 figures
Data-Driven Personalization of Automated Insulin Delivery
Automated insulin delivery (AID) systems are often tuned for the population and offer limited online adaptation to the inter- and intrapatient variability in insulin needs caused by meal patterns, physical activity, and fluctuations in insulin sensitivity. We present a real-time, data-driven personalization approach that adapts controller parameters using the subject's daily glycemic data. The adaptation is formulated as projected gradient descent on a daily risk metric, where the gradient estimation is designed to attenuate noise and metabolic variability. We use contraction theory to validate the optimization framework and convergence of the closed-loop system under adaptation. In silico experiments on the 100-adult cohort of the FDA-accepted UVA/Padova T1D simulator show that our method improves glycemic risk and increases time-in-range (TIR, 70-180\,mg/dL) by 2%, 3%, and 4% after 4, 8, and 17 weeks, respectively, under variability in meal timing, meal size, and insulin sensitivity.
Sandbox-Enabled Digital Twin for Cyber-Physical Systems
Cyber-physical system (CPS) controllers are vulnerable to faults and malicious attacks, including failures triggered only under complex plant conditions, yet pre-deployment validation typically relies on plant models or digital twins that exercise the controller's I/O as a black box. Side-channels, used to detect those run-time behavioral anomalies, are complementary but also open-loop, detached from I/O instrumentation, and driven by synthetic inputs rather than realistic plant feedback. We present a closed-loop digital twin framework that bridges this gap by hosting an unmodified controller binary within the SaMOSA Linux sandbox with its I/O rerouted to an external plant simulator, allowing coupled capture of simulated plant conditions and events alongside the controller's behavioral side-channels. The framework captures four time-synchronized controller side-channels (hardware performance counters, system calls, disk activity, and network activity) alongside plant state, and uses orchestration hooks for repeatable, parameterized runs. We demonstrate the framework on an OpenPLC runtime executing a Structured Text control program against a Modbus-connected IEEE 14-bus power-system model, and also discuss the application to robotics systems. The captured side-channels correlate controller behavior with simulated plant events, establishing an observability foundation for online testing, coverage analysis, and anomaly detection in CPS controllers.
comment: 5 Pages, 4 Figures
Task-Error Residual Learning for Real-Robot Five-Ball Juggling
For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at https://kai-ploeger.com/residual-juggling.
comment: Submitted to the 2026 International Symposium on Robotics Research (ISRR)
When Should a Robot Replan? Regret-Guided Update Scheduling in Time-Varying MDPs
Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.
Classifying Transient Regimes in Dynamic Systems through Properties of Spatial Curves and Stochastic Processes: A Data-Driven Approach
This article proposes a novel methodology for the classification of transient and stationary regimes in dynamic systems. Several sensor-based solutions for regime classification in the literature require the setting of several parameters, or are not suitable for scenarios involving multivariate systems that may contain periodic signals. The proposed method introduces a spatial curve representation of the considered system based on its sample mathematical moments. Then, by connecting concepts of stability theory, geometrical properties of spatial curves and stationary stochastic processes, two regime classifiers are designed using the arc length and the curvatures of the proposed curve. Both classifiers are capable of describing and detecting transient regimes, considering behaviors such as: multivariate asymptotically, marginally stability, and cyclostationarity. Furthermore, a quantitative comparison in performance and computation resources of the proposed classifiers against existing classifiers in the literature illustrates that the proposed regime classifier based on the arc length outperforms other techniques in classifying transient regimes for simulated linear, non-linear, and discontinuous multivariate systems under the specified studied conditions.
Line Outage Impact Factor (LOIF): A New Sensitivity Factor for Enhanced Transmission Observability
Transmission failures can lead to cascading failures and system blackout affecting millions of customers if not handled in time, and choosing the best locations to monitor the condition of the transmission system is crucial for power system reliability. In this paper, we propose a new sensitivity factor, the line outage impact factor (LOIF), which is especially useful for power system monitoring and can reveal the impacts of a transmission outage on the power flow of other lines more effectively than existing sensitivity factors, such as the line outage distribution factors (LODF). In this study, we apply the LOIF in transmission line outage detection in three test systems and compare it with LODF using a number of observed transmission line (OTL) selection methods based on these two sensitivity factors. Then we apply a machine learning algorithm to detect the outages of other lines by monitoring the selected OTLs, and the detection accuracy is evaluated using the F1-score. The results show that, in general, with the same number of OTLs, detection using the OTLs selected using LOIF achieved higher F1-scores. The pattern was especially consistent in large-scale systems, showing its potential in real-world applications.
Robust Direct Data-Driven Hamiltonian for Safe Set Computation under Measurement Noise and Disturbances
Safe set computation is a fundamental challenge in safety-critical control systems, especially in direct data-driven settings where safety analysis is performed directly from noise-affected measurements, without explicit modeling. A recently proposed method, Data-Driven Hamiltonian (DDH), enables reachability analysis directly from measurements, without relying on prior knowledge of the underlying system dynamics. This paper extends the DDH framework to a robust setting that accounts for measurement noise, exogenous disturbances, and sampling-induced state-velocity estimation error. A Robust Data-Driven Hamiltonian (R-DDH) is derived from noisy measurements and shown to yield a certified lower bound on the exact Hamiltonian. This results in a provable under-approximation of the value function and an inner approximation of the associated safe set. The gap between the data-driven and exact Hamiltonians is quantified, and it is shown to converge to zero with more data in a noise-free setting with additive disturbances. The effectiveness of the approach is shown through two case studies: a constrained double integrator and an aircraft taxiing system with a nonlinear closed-loop controller operating under perceptual uncertainty.
Optimal Powered Descent Guidance with Pyramid-Shaped Approach-Angle Constraints
In this paper, a novel optimal soft-landing guidance law with inequality approach-angle path constraints is analytically derived. The proposed guidance law prevents ground collision and enables approach-angle control by constraining the optimal trajectory to remain within a convex inverted pyramid originating at the landing point. A 3D point-mass linear kinematic model in a constant gravitational field is employed, together with a quadratic control-effort cost and terminal constraints on position and velocity. Analytical open-loop and closed-loop solutions, together with the optimal final time, are derived using Pontryagin's Minimum Principle and the optimality conditions at the transitions between unconstrained and constrained arcs. It is additionally shown that the optimal final time decreases when the path constraints become active. The resulting guidance law is continuous, piecewise linear in time, and nonlinear in the states in closed-loop. When a constraint becomes active, the controller cancels the gravitational component normal to the constraint, causing the trajectory to evolve along the constraint surface. The proposed guidance law is evaluated in simulations under various initial conditions, demonstrating accurate landing performance and consistent satisfaction of the path constraints.
comment: This work has been submitted for journal publication. 36 pages 10 figures
Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains
In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.
SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.
Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception
Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.
A Stateful Stochastic Allocation Mechanism with Fairness Guarantees for Networked Electricity Systems
This paper develops and analyses the Fair Play Automatic Market Maker (FP-AMM), a programmable electricity allocation mechanism in which scarcity allocation is treated as a controlled, stateful, and auditable cyber-physical process. Existing mechanisms such as locational marginal pricing are memoryless and cannot account for historical service outcomes, preventing guarantees of equitable treatment across market intervals. The FP-AMM employs a two-stage stochastic clearing rule comprising service-priority sampling and inverse-fairness weighting, coupled with a DC-OPF feasibility set and bounded shortage memory updated through a saturated integrator. Four main results are established. First, the shortage-memory state is invariant in $[0,1]^N$ and the update map is a contraction with rate $1-β$. Second, the intra-interval clearing operator converges linearly to a unique fixed point with contraction factor $q\in(0,1)$. Third, under the Fair Play priority rule, the per-node delivery ratio converges almost surely to the contracted target $F^\star$, with a finite-time $O(1/\sqrt{T})$ bound obtained via Lyapunov analysis of the deficit recursion. Fourth, event-triggered execution guarantees practical ultimate boundedness of the allocation tracking error and quantifies the computation-fidelity trade-off. The mechanism is validated on the IEEE 14-, 57-, and 118-bus systems over $T=5000$ market intervals. Fairness convergence to $F^\star$ is achieved on all benchmarks, peak weak-bus fairness error is reduced by 54% on the IEEE-57 network and by up to 55% relative to an equal-weight baseline during scarcity periods, and DC feasibility is maintained throughout.
On the Strong Duality in Continuous-time and Discrete-time Linear Quadratic Regulators
This paper revisits the strong duality in the linear quadratic regulator (LQR) for continuous-time and discrete-time systems, and explores its interconnection with typical assumptions and the uniqueness of primal-dual solutions. Using a linear operator $Ψ$, we formulate a common nonconvex LQR problem that captures both time domains. We then derive its Lagrange dual problem and establish the strong duality via a rank-constrained tight semidefinite program (SDP) relaxation. Further, we show that the primal-dual optimal solutions to the SDP relaxation, after dropping the rank constraint, recover the classical algebraic Riccati equations and optimal feedback gains in a constructive manner. The dual derivation and strong duality analysis rely on mild standard assumptions and exploit the properties of the linear operator and its adjoint, revealing a structural symmetry between the two time domains.
comment: 8 pages. Accepted for publication in the IEEE Control Systems Letters (L-CSS)
Saddle Point Evasion via Curvature-Regularized Gradient Dynamics
Nonconvex optimization underlies many modern machine learning and control tasks, where saddle points pose the dominant obstacle to reliable convergence in high-dimensional settings. Escaping these saddle points deterministically using continuous-time optimization remains an open challenge: gradient descent is blind to curvature, stochastic perturbation methods lack deterministic guarantees, and Newton-type approaches suffer from Hessian singularity. Adopting the perspective of viewing optimization algorithms as dynamical systems, we present Curvature-Regularized Gradient Dynamics (CRGD), which augments the objective with a smooth penalty on the negative Hessian eigenvalues, yielding an augmented cost that serves as an optimization Lyapunov function with user-selectable convergence rates to second-order stationary points. Numerical experiments confirm that CRGD converges to second-order stationary points, even in regimes where gradient descent fails.
comment: Accepted for publication in IEEE Control Systems Letters. 6 pages, 3 figures
An Information Theory of Finite Abstractions and their Fundamental Scalability Limits
Finite abstractions are discrete approximations of dynamical systems, such that the set of abstraction trajectories contains all system trajectories. There is a consensus that abstractions suffer from the curse of dimensionality: for the same ``accuracy" (how closely the abstraction represents the system), the abstraction size scales poorly with system dimensions. And yet, after decades of research on abstractions, there are no formal results on their accuracy-size tradeoff. In this work, we derive a statistical, quantitative theory of abstractions' accuracy-size tradeoff and uncover fundamental limits on their scalability, through rate-distortion theory -- the information theory of lossy compression. Abstractions are viewed as encoder-decoder pairs, encoding trajectories of dynamical systems. Rate measures abstraction size, while distortion describes accuracy, defined as the spatial average deviation between abstract trajectories and system ones. We obtain a fundamental lower bound on the minimum achievable abstraction distortion, given the system dynamics and the abstraction size; and vice-versa a lower bound on the minimum size, for given distortion. The bound depends on the complexity of the dynamics, through trajectory entropy. We demonstrate its tightness on some dynamical systems. Finally, we showcase how this new theory enables constructing minimal abstractions, optimizing the size-accuracy tradeoff, through an example on a chaotic system.
Same-Origin Policy for Agentic Browsers
Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.
ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic SC
We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.
comment: 8 pages, 1 figure, 4 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version
The GIST 2217-Bus Test System: A Public-Data Synthetic Model of the Korean Power Grid
No model of the Korean transmission system at native resolution is publicly available, which makes reproducible research on one of the world's most distinctive grids difficult-an islanded interconnection with extreme separation between generation and the Seoul Metropolitan Area load center, low renewable penetration, and heavy reliance on extra-high-voltage (EHV) transmission. Working strictly from public data, and for research purposes only, we present the GIST 2217-bus test system, a geographically grounded synthetic model of the Korean grid. Unlike fully synthetic cases, whose lines match no real corridor, and aggregated public Korean models, it derives its 345 and 154 kV layout from the OpenStreetMap/OpenInfraMap power layer by a multi-source shortest-path reassembly of overhead-line geometry, gap-fills unreachable substations with a geographic minimum-spanning-tree backbone, and calibrates the aggregate circuit length to published national statistics (94/100/109% at 765/345/154 kV). The model spans 2217 buses, 512 generation and renewable sources (144 GW), 3708 AC line circuits plus four high-voltage direct-current (HVDC) converter links, 3324 transformers, and reactive resources (shunts and 11 FACTS devices), serialized to a PSS/E-compatible CSV schema. The model is distributed as a frozen operating point-taps, setpoints, and bus voltages settled once offline and baked into the data-so a single deterministic pandapower Newton-Raphson pass (with reactive limit enforcement and HVDC converter settling) reproduces an 85 GW high demand snapshot at a single connected operating point (mean transmission voltage 0.995 pu, 2.6 % losses), structurally consistent with the independent public KPG193 model. The dataset, maps, and tooling are released as a citable platform for power flow, planning, and decarbonization studies.
comment: 11 pages, 5 figures, 5 tables
Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.
comment: 31 pages, 19 figures, Conference version published in Int'l Conference on Computer Aided Design (ICCAD) 2025. Journal version (current version, revision 2) accepted in ACM TRETS 2026
Distributed Omniscient Observers for Multi-Agent Systems: Design and Applications
This paper proposes distributed omniscient observers for both heterogeneous and homogeneous linear multi-agent systems, such that each agent can correctly estimate the states of all agents. The observer design is based on local input-output information available to each agent, and knowledge of the global communication graph among agents is not necessarily required. The proposed observers can contribute to distributed Nash equilibrium seeking in multi-player games and the emergence of self-organized social behaviors in artificial swarms. Simulation results demonstrate that artificial swarms can emulate animal social behaviors, including sheepdog herding and honeybee dance-based navigation.
Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of the sequential and parallel Riccati recursions for solving dual-regularized linear-quadratic regulator (LQR) problems, with $O(N)$ sequential time and $O(\log(N))$ parallel time, respectively. We show that these subproblems arise when using regularized primal-dual interior-point methods to solve smooth, constrained, non-convex, discrete-time optimal control problems via multiple-shooting, even in the presence of stagewise equality or inequality constraints, and without imposing any rank requirements on constraint Jacobians. We prove that, when certain inertia conditions on the Newton-KKT matrix are met, each nonzero primal step is a descent direction of an augmented barrier-Lagrangian merit function. We characterize these inertia conditions in terms of the positive-definiteness of the dual-regularized Riccati pivots (a weaker condition than the standard LQR positive-definiteness requirements), thereby yielding inexpensive certificates of the required inertia. We provide MIT-licensed implementations of our methods in C++ and in JAX, as well as a full formalization of our results in Lean. We benchmark our algorithm against leading optimal control and nonlinear programming solvers on complex trajectory optimization problems, establishing competitive performance on moderate problems and substantial gains as the horizon length, problem dimension, and constraint count increase.
Data-driven invariant set for nonlinear systems with application to command governors
This paper presents a novel approach to synthesize positive invariant sets for unmodeled nonlinear systems using direct data-driven techniques. The data-driven invariant sets are used to design a data-driven command governor that selects a command for the closed-loop system to enforce constraints. Using basis functions, we solve a semi-definite program to learn a sum-of-squares Lyapunov-like function whose unity level-set is a constraint admissible positive invariant set, which determines the constraint admissible states and input commands. Leveraging Lipschitz properties of the system, we prove that tightening the model-based design ensures robustness of the invariant set to the inherent plant uncertainty in a data-driven framework. To mitigate the curse-of-dimensionality, we repose the semi-definite program into a linear program. We validate our approach through two examples: First, we present an illustrative example where we can analytically compute the maximum positive invariant set and compare with the presented data-driven invariant set. Second, we present a practical autonomous driving scenario to demonstrate the utility of the presented method for nonlinear systems.
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.
comment: Project website: https://omniretarget.github.io
Trapping Regions for Quadratic Systems with Generalized Lossless Nonlinearities
We consider a class of quadratic systems, primarily motivated by incompressible fluid flows, where the nonlinearities are generalized lossless: they do not produce or dissipate energy, as measured by a generalized quadratic metric. Our goal is to compute trapping regions, which are forward invariant sets that certify ultimate boundedness. The key contribution is a novel parameterization of the generalized lossless condition that enables optimization of trapping regions for a broader class of quadratic systems. We also formulate the conditions for ellipsoidal trapping regions, whereas spherical regions have been the focus of prior works. We provide three numerical examples, which demonstrate the improvements offered by the proposed approach relative to existing methods.
Robotics
Anisotropic Template Ansätze for Robust Positive Invariance under State-Dependent Uncertainty
We establish sufficient conditions for robust positive invariance under state- and input-dependent disturbances with anisotropic covariance structure. The proposed ansatz maps a fixed ellipsoidal template through a GP-derived positive-definite matrix field, subsuming scalar homothetic scaling while retaining finite graph-based verification. The resulting LMI conditions couple the learned field to Schur-stable dynamics; an isotropic fallback with inflation factor $r=1/(1-γ_{\mathrm{cl}})$ proves admissibility. During each learning epoch the field is frozen, so online tube evaluation is one GP covariance query and a small matrix square root, with no online set iteration or LMI solve. Quadrotor simulations show a $195\times$ reduction in 3D velocity-tube volume and a $2.1{\times}10^5$ reduction in the joint 7D velocity-control subspace relative to a non-adaptive homothetic baseline. This extended version adds full proofs, a separated offline/online complexity analysis, and controller-sweep, contraction, and projection-area studies.
A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation CEC
Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.
comment: This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore
Leveraging Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles
This work explores the use of artificial intelligence in mobile robotics to achieve autonomous detection and pose estimation of load carriers for automated pickup. A deep neural network is designed to recognize predefined landmarks on the carrier from RGBD data; these landmarks are then used to compute the carrier's pose. The network operates directly on RGBD images to estimate landmark positions, which form the basis for determining the carrier's location. The approach is validated in extensive experiments and comprises both software and hardware implementations. A deep learning-based framework is presented to detect load carriers and estimate their pose for use with autonomous logistics vehicles. Our method uses a convolutional neural network to identify characteristic reference points on the carrier from RGBD input and computes its pose by combining these inferred landmarks with prior geometric knowledge. Experiments show that the resulting accuracy is sufficient for reliable load carrier detection in industrial environments, confirming the suitability of the method for autonomous intralogistics applications.
comment: 6 pages, 6 figures, IFAC World Congress2026, \c{opyright} 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND
$λ$-Reachability: Geometric-Horizon Safety Bellman Equations for Humanoid Safety
We introduce $λ$-Reachability, a scalable approach to Hamilton--Jacobi safety analysis for high-dimensional robotic systems. Unlike prior discounted formulations that rely on fixed one-step Bellman updates, $λ$-Reachability employs a stochastic multi-step estimator of the safety value, using a geometrically distributed rollout horizon together with a randomly absorbed terminal. Conceptually analogous to TD($λ$), $λ$-Reachability interpolates between local self-consistency updates and long-horizon max-over-trajectory safety targets via an interpretable horizon-control parameter. Unlike TD($λ$), where the terminal value is always incorporated in learning targets, the terminal safety value in $λ$-Reachability is only used at a probability controlled by parameter $δ$. We formally show that for $δ<1$, the update induces a contraction mapping that allows temporal-difference learning; as $λ\to 1$, the estimator recovers the undiscounted reachability objective. We apply $λ$-Reachability to high-dimensional safety learning problems with both simulated and real humanoid robots under balance and collision avoidance constraints. Experimental results demonstrate that $λ$-Reachability significantly improves both safe-set boundary classification and safety margin estimation compared to single-step temporal-difference baselines.
Friction Characterization of a Cable-Driven Differential Actuation System for Lower-Limb Exoskeletons
Lower-limb exoskeletons require actuation systems that can provide accurate joint torque control while preserving low mass and encumbrance. Conventional architectures often rely on independently actuated joints and joint-level torque sensors, increasing system complexity and weight. This paper presents a novel differential actuation architecture for hip-knee flexion/extension, enabling cooperative torque sharing between two motors via a linear differential mapping between motor and joint. To compensate for transmission losses, a model-based friction estimation strategy is developed and experimentally implemented, allowing accurate joint torque estimation without the need for torque sensors. The proposed solution is validated on a physical prototype, demonstrating the feasibility of sensorless torque estimation in a differentially actuated hip-knee module of a lower-limb exoskeleton.
comment: Accepted for presentation IEEE RAS/EMBS 11th International Conference on Biomedical Robotics and Biomechatronics
ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation
Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.
Energy-Efficient Arm Reaching for a Humanoid Robot via Deep Reinforcement Learning with Identified Power Models
Humanoid robots performing in-field manipulation tasks, such as robotic apple harvesting, face severe energy constraints that directly limit the number of reaching motions that can be executed per battery charge. This paper presents an end-to-end, energy-aware reinforcement learning framework for the 7-degree-of-freedom left arm of the Unitree~G1 humanoid robot, combining a physics-based, experimentally identified electrical power model with a Soft Actor-Critic (SAC) policy trained in a Pinocchio-based rigid-body dynamics simulator. The RL policy operates on an incremental joint-position action space and is trained with a Hybrid Constellation Reward that combines a four-point end-effector constellation distance with a torque-norm energy proxy; after % $5\times10^6$ training it reaches a $69.9\%$ success rate over $1\,000$ random targets in kinematic simulation, at a mean energy of \SI{98.16}{\joule} on successful episodes. Finally, on the physical Unitree~G1, the policy is validated over three independent 10-target batches, achieving a mean energy of $71.5 \pm 48.3$\,J, an end-effector position error of $2.64 \pm 1.04$\,cm, and an orientation error of $6.92 \pm 1.33^\circ$ -- within the \SI{4}{\centi\metre}/$8.6^\circ$ training tolerance. These results constitute a first step toward energy-aware reinforcement-learning-based arm reaching for humanoid robots.
Identification of a Physics-Based Electrical Power Consumption Model for the Unitree G1 Humanoid Arm
Accurate prediction of electrical power consumption is essential for energy-aware motion planning, battery management, and thermal monitoring in battery-powered humanoid robots. This letter presents a physics-based, linear-in-parameters model for the electrical power consumption of the seven-degree-of-freedom left arm of the Unitree~G1 humanoid robot. The proposed formulation combines actuator loss terms with a baseline-torque correction that captures changes in gravity-compensation load and enables accurate prediction of negative net power trajectories. Pairwise interaction terms are introduced to model power coupling during simultaneous multi-joint motion. Model parameters are identified from experimental data collected on a physical Unitree~G1 using onboard power measurements as the regression target. Across 897 trajectories covering single-joint and coordinated arm motions at multiple speed levels, the identified model achieves $R^2 = 0.933$ with an RMSE of 1.07 (W). Validation on 46 trajectories executed at previously unseen speeds yields $R^2 = 0.965$, demonstrating strong generalisation beyond the identification dataset. Analysis of the identified parameters reveals distinct power-consumption characteristics across the arm, with viscous friction dominating most joints (shoulder pitch and all three wrist joints), copper losses dominating shoulder yaw and the elbow, and shoulder roll uniquely dominated by Coulomb friction.
GeoTLM: Geometry-aware Tactile-Language Models for Contact Motion Orientation Reasoning of Dynamic Objects
Modern tactile-language models (TLMs) have shown potential for robot learning tasks, such as material and texture recognition. However, for contact-rich scenarios, these TLMs struggle to understand the physical properties of dynamic objects, such as rotation and sliding directions. For instance, our preliminary experiments reveal that popular TLMs, such as Sparsh and AnyTouch2, exhibit weak performance on basic rotation direction reasoning from GelSight Mini tactile data. This surprising gap inspires us to explore a novel research question: Can we inject physically grounded geometric priors into TLMs to enable reliable contact orientation reasoning of dynamic object properties? To this end, we propose GeoTLM, a novel geometric representation-guided TLM for the perception of dynamic contact events. Our key idea is to preserve and structure tactile shear-field geometry before language-level reasoning, rather than forcing low-resolution tactile tokens into fragile closed-form physics operators. To achieve this, we propose a lightweight (only 14k parameters) yet novel Differentiable Geometric Representation (DGR). Specifically, DGR learns a contact-mask-guided representation in the shear field and aggregates it through an antisymmetric seven-region pooling design, motivated by the physical intuition that rotational contact produces antisymmetric deformation patterns. We conduct experiments on two representative tasks: rotation direction and sliding direction reasoning. Extensive experiments show that GeoTLM improves novel-object rotation accuracy by +14.6% and real-sensor sliding accuracy by +16.2% over the same backbone without the geometric encoder. Overall, our work paves a new way for physically grounded tactile-language reasoning, with strong potential for dynamic object understanding and contact-rich robotic manipulation.
comment: 7 pages, 3 figures, 4 tables
VL2Spike: Spike-driven Distillation from VLMs for Low-Power Visual Perception in Embodied AI
Spiking neural networks (SNNs) are brain-inspired, event-driven models that compute with sparse spikes, which enables highly efficient visual perception in resource-constrained embodied AI models. The emergence of Spiking-Transformer models with spike self-attention has substantially improved the learning capacity of pure SNNs. Although SNNs are energy efficient, their performance is still limited by the spike-based architecture and optimization challenges, as standard gradient descent rules cannot be directly applied. Recently, vision-language models (VLMs) have shown rich multi-modal knowledge representation capabilities for visual perception. Thus, it is promising to leverage VLMs for better Spikformer training. To this end, we present VL2Spike, a novel spike-based knowledge distillation (KD) framework that bridges multi-modal knowledge from VLMs with compact Spikformer models. This design enhances the learning capacity of Spikformer models while preserving their energy-efficiency merits, thereby offering a practical pathway toward low-power robotic perception. Our VL2Spike brings two key technical contributions. To align with spiking dynamics, we first propose spatial-temporal visual spike (SVS) distillation, which achieves (1) shared manifold alignment between VLM image features and spike tokens, and (2) warm-started temporal consistency on membrane potentials and spike rates. We then design a novel spike prototype-guided linguistic (SPL) distillation strategy that aligns Spikformer's class prototypes and logits with promptable VLM text embeddings. Extensive experiments show that VL2Spike achieves 6.81% gain across three static datasets with only 15.7% energy consumption. It also exhibits strong generalization capacity on robotic visual place recognition (VPR) with a gain of 6.63%, highlighting its potential for low-power perception in embodied AI.
comment: 9 pages, 4 figures, 8 tables
LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors
Learning-based quadrupedal locomotion typically relies on complex reward formulations that entangle task specification, operational limits, gait preference, and terrain adaptation within a single optimization objective. We instead treat these functions through distinct mechanisms: rewards for task specification, constraints for operational limits, energy minimization for gait preference, and exteroceptive perception for adapting energy use to terrain difficulty. We show that these components jointly enable efficient, terrain-adaptive locomotion, and that removing each component exposes a distinct failure mode. Our formulation removes explicit gait priors (including air-time, contact-count, and foot-clearance targets) in favor of emergent behavior. Compared to a conventional complex-reward baseline, our formulation achieves comparable terrain traversal while reducing cost of transport by 56% and operational-limit violations by 96%. The resulting policies transfer zero-shot to a physical Unitree Go2 using LiDAR-based elevation mapping. Project website with videos: https://tinyurl.com/locomposition.
comment: 17 pages, 5 figures, 10 tables
FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds
Deep reinforcement learning has shown strong potential for robot navigation, but its practical deployment is still limited by the long wall-clock cost of policy training. This paper presents FlashNav, a GPU-first framework for ultra-fast range-based robot navigation training. To the best of our knowledge, FlashNav is the first DRL-based robot navigation framework that reaches seconds-level policy training, with the fastest deployable policy trained in less than 20 seconds. The key idea is to align simulation with the navigation MDP: FlashNav preserves the essential components for velocity-level navigation, including occupancy geometry, range sensing, goal-conditioned control, robot motion dynamics, collision handling, termination, and reset, while removing unnecessary rendering and high-fidelity physical details from the training loop. Built on a batched bitmap simulator and a fully GPU-resident training pipeline with our FastDSAC learner, FlashNav generates massive parallel navigation transitions entirely on GPU. Experiments on TurtleBot2 and Unitree Go2 show that FlashNav achieves a 100\% success-rate below 20 seconds on an RTX 5090 and remains within tens of seconds across desktop GPUs. The learned policies further transfer to physical wheeled and legged robots in static and dynamic indoor scenes, demonstrating that DRL-based navigation can be trained at seconds-level speed while preserving deployable obstacle-avoidance behavior.
comment: 15 pages, 4 figures
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.
Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models
Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.
Can Causal Models Enhance Robot Navigation? Online Causal Adaptation for Real-Robot Navigation
Causality in robotics aims to produce more interpretable and flexible robot behaviours by enabling robots to predict the consequences of their actions; however, deploying causal models with existing systems (e.g., navigation) operating in real environments remains understudied. This paper addresses the challenging problem of transferring causal models in real-robot experiments for a navigation scenario. We study this problem in two ways: (i) using the causal model as an offline evaluation module that predicts the competence of recorded real-robot navigation trajectories and relates it to quantitative navigation performance, and (ii) using the causal model as an online adaptation module that intervenes when the predicted competence of the default navigation is low. We validate our approach in a physical service robot that patrols around corridors. We show that the predicted competence correlates positively with path efficiency, and negatively with path irregularities (suboptimal behaviour). The model predictions also show strong agreement with human annotations (Cohen's kappa value of 0.88). In online experiments, the proposed method improves navigation performance in complex scenarios such as cornering and obstacle avoidance, yielding higher predicted competence and better navigation metrics than the default navigation baseline. In simpler scenarios, where the baseline already performs near-optimally, the causal adaptation provides limited benefit. These results indicate that causal models are particularly effective in enhancing navigation under increased task complexity. Overall, our results demonstrate that causal models developed for behavioural interpretation can be successfully integrated into real-robot navigation systems.
comment: Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning
Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.
comment: 13 pages, 5 figures
PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty
Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.
TO-SoFiT: Topology Optimization of Hydraulic Soft Fish Tail Design for programmable undulating locomotion
Soft robots leverage compliant materials to generate motion through controlled elastic deformation, making them ideal for delicate tasks such as underwater exploration and biomimetic marine systems. Although hydraulic/pneumatic actuation remains pivotal for such systems, the lack of systematic design frameworks has hindered the development of robots capable of complex 3D motion, such as fish-like swimming. This work introduces a topology optimization method to automate the design of a hydraulic soft fish tail, explicitly addressing the design-dependent coupling between fluidic actuation and structural deformation. We use a Darcy law-based model augmented with a drainage term to simulate spatially varying hydraulic pressure loads, translating these into consistent nodal forces via finite element analysis. The employed robust multi-criteria optimization formulation balances deformation efficiency, fluid-structure interaction, geometric manufacturability, and required stiffness for optimizing a bioinspired soft fish tail for 3D swimming kinematics. The optimized tail topology is incorporated into a pneumatic network actuator and computationally validated under various hydraulic loads, achieving tunable undulatory amplitudes and multiaxis bending for depth adjustment. The optimized 2D tail outperforms its rectangular counterpart. By cascading optimized tail segments, we demonstrate programmable swimming patterns in soft robotic fish tails at different hydraulic loads. This work advances the systematic codesign of hydraulic actuators and soft structures, offering a pathway to automate underwater robots with optimized design and vertebrate-like agility in confined aquatic environments. Our implementations and simulations are publicly available at 'https://github.com/PrabhatIn/TO-SoFiT'.
comment: Accepted for publication at the Advances in Robotics (AIR), 2025, IIT Jodhpur
Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.
comment: https://recap-robot.github.io/
Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC
We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.
Perfect Demo Makes Poor Teacher: Learning Robust Alignment from Critical Motion Segments
Expert demonstrations are widely assumed to be the gold standard for robot imitation learning. Yet for fine-grained manipulation such as insertion, stacking, and alignment, we uncover a counterintuitive failure mode: fluent demonstrations can be poor teachers. A skilled teleoperator compresses the decisive moments of alignment and recovery into a brief temporal window, leaving the policy flooded with redundant free-space motion and starved of supervision exactly where precision determines success. We address this bottleneck at two levels. At the data level, slowing down near alignment and resampling critical segments both help, yet the gain comes mainly from broadening the coverage of recovery states the policy must learn, not from reweighting frames it already has. Such data-side fixes, however, leave the policy's per-frame view untouched: a single image still maps directly to an action, and the local motion that governs correction stays implicit. We therefore turn to the representation level and introduce STAIR (\textbf{S}patio-\textbf{T}emporal feature \textbf{A}s an \textbf{I}nterface for \textbf{R}obot learning), a compact dynamic feature that bridges the vision-language model and the action expert, distilling the short-horizon motion already recorded in each trajectory into dense, motion-aware supervision. Trained on fluent data alone, STAIR recovers most of the deliberate-demonstration gain ($50.0$ to $62.2\%$ overall, approaching the $64.4\%$ of deliberate demonstrations). These results call for a more pedagogical view of robot data, optimized for machine learnability rather than human efficiency alone.
SAPS: Shared Autonomy for Policy Steering by Blending Teleoperation with a Pretrained VLA
Recent advancements in Vision-Language-Action (VLA) models have demonstrated impressive generalist capabilities in robot manipulation, yet these policies can be brittle under out-of-distribution spatial and semantic perturbations. While human teleoperation offers reliable recovery, it can demand high cognitive load and precise manual control, and existing policy steering methods often require auxiliary models or sampler modifications. In this work, we introduce Shared Autonomy for Policy Steering (SAPS), a framework that blends real-time human teleoperation commands with pretrained policy actions at the action level. SAPS requires no policy retraining, auxiliary dynamics models, or architectural modifications. We propose and evaluate three arbitration strategies to balance human and VLA policy control, including a dynamic Cosine-similarity arbitration strategy that computes the geometric agreement between human and policy actions. Across evaluations in simulation (LIBERO, LIBERO-PRO, CALVIN) and on real-world robot hardware, SAPS improves task success rates over autonomous execution by up to 82% in both simulation and the real world. Furthermore, our approach drastically reduces human intervention compared to pure teleoperation, while simultaneously achieving faster task completion times than both autonomous execution and pure teleoperation. These results demonstrate that action-level shared autonomy is a practical, model-agnostic approach for reliably deploying generalist robot policies in real-world contexts involving a human operator,with promising applications in assistive teleoperation and scalable data collection.
comment: 23 pages, 15 figures, 5 tables
Robots as Tokens: Unified Diffusion Transformer for Coordinated Multi-Robot Trajectory Generation
The success of generative models in language and visual generation has inspired extensive applications to generative robot planning. However, most existing works either focus on single-robot planning, or generate multi-robot trajectories in a sequential manner with iterative post-processing to resolve inter-robot conflicts. In this work, we investigate whether coordinated multi-robot trajectories, as a special spatiotemporal distribution, can be learned and generated with a generative model in a feed-forward manner. We propose Robots as Tokens (Roken), a unified diffusion transformer that directly generates multi-robot trajectories that satisfy both (individual) safety and (global) connectivity constraints. The core design of Roken is to represent each robot as a discrete token, allowing them to naturally interact with each other through self-attention, and cross-attend to map tokens for environment layouts. We further introduce several auxiliary tasks based on Bayes' theorem to provide multi-scale spatial-temporal supervision for efficient learning of the conditional distribution. In training, Roken absorbs diverse expert trajectories from different team sizes. During inference, Roken behaves as a versatile multi-robot planner that can handle single-robot planning, coordinated multi-robot trajectory generation, and conditional trajectory generation by fixing some robot tokens as conditions. Experiments in diverse cluttered environments show that Roken can generate coordinated multi-robot trajectories to perform connectivity-constrained goal navigation tasks with high success rates, outperforming the baseline method used to generate the training dataset. Roken also demonstrates good scalability after training with mixed team sizes, and shows generalization to unseen or partially observed environments, verifying its potential to learn from diverse data and perform versatile tasks.
comment: 23 pages, 13 figures; \textbf{Project page:} \href{https://bairuofei.github.io/roken-project-page/}{\texttt{bairuofei.github.io/roken-project-page}}
NIMO: A Software Platform for Closed-Loop Materials Exploration with Diverse AI Algorithms
Self-driving laboratories (SDLs), where artificial intelligence proposes subsequent experiments and robotic systems execute them, are rapidly becoming the vanguard of materials discovery. A critical bottleneck, however, lies in seamlessly bridging diverse AI algorithms tailored for specific exploration goals with the heterogeneous robotic hardware found across different laboratories. Here, we present NIMO, an open-source software platform designed to dissolve this barrier through three core paradigms: a modular AI-robot decoupling mediated via simple CSV file exchange, a discrete candidate-pool architecture that seamlessly absorbs domain knowledge, and a unified Python interface pre-loaded with twelve distinct AI algorithms. In this Perspective, we review the operational principles of each algorithm alongside six diverse SDL implementations driven by NIMO, covering electrolyte discovery, organic synthesis, thin-film exploration, fuel-cell process informatics, coffee-ring phase exploration, and legacy liquid-handling automation. One of these also demonstrates NIMO's seamless interoperability with the IvoryOS orchestration framework. To democratize autonomous science, we also introduce a no-code desktop application that enables intuitive, human-in-the-loop exploration for non-programmers. NIMO is freely available at https://github.com/NIMS-DA/nimo, offering a versatile, plug-and-play foundation to accelerate autonomous materials exploration across diverse experimental landscapes.
comment: 29 pages, 5 figures
Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands
Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.
comment: Website(overview): transferring-contact-not-just-motion.github.io
RSPECT: Robust and Scalable Planner for Energy-Aware Coordination of UAV-UGV Teams in Aerial Monitoring
We consider the robust planning of energy-constrained unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which act as mobile charging stations, to perform long-horizon aerial monitoring missions. More specifically, given a set of points to be visited by the UAVs and desired final positions of the UAV-UGV teams, the objective is to find a robust plan (the vehicle trajectories) that can be realized without a major revision in the face of uncertainty (e.g., unknown obstacles/terrain, wind) to complete this mission in minimum time. We provide a formal description of this problem as a mixed-integer program (MIP), which is NP-hard. Since exact solution methods are computationally intractable for such problems, we propose RSPECT, a scalable and efficient heuristic. We provide theoretical results on the complexity of our algorithm and the feasibility and robustness of resulting plans. We also demonstrate the performance of our method via simulations and experiments.
comment: Accepted to the Journal of Intelligent & Robotic Systems (JINT)
Systematic Evaluation of Novel View Synthesis for Video Place Recognition IROS 2026
The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.
comment: Submitted to IEEE IROS 2026
Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly
Animals perform coordinated whole-body movements under the control of neural systems shaped by brain-wide connectivity. The mapping of the whole-brain neural connections, or the connectomes, provides a natural graph for modeling sensorimotor information flow, yet its potential as a neural controller for embodied agents remains largely unexplored. Here, we introduce the Fly-connectomic Graph Model, which directly instantiates the whole-brain connectome of an adult Drosophila as a graph-structured neural controller for movements of a simulated biomechanical fruit fly via deep reinforcement learning. We achieve stable performance across diverse locomotion tasks, as well as better sample efficiency compared to both graph and non-graph baselines. Our results demonstrate a biologically informed way towards effective control policy design by translating whole-brain wiring principles into actionable architectural priors, while also improving the interpretability through dynamic information flow. This work also highlights the potential to bridge neuromechanics with embodied intelligence by providing a computational platform for investigating the sensorimotor transformation underlying animal behavior and a paradigm to advance the development of more nature-aligned intelligent systems.
IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% -> 97.1). Code and visualizations are available at: jongwoopark7978.github.io/IVRA
CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.
comment: The code is available at: https://github.com/LinShan-Bin/OpenCLAP
C-3TO: Continuous 3D Trajectory Optimization on Neural Euclidean Signed Distance Fields
This paper introduces a novel framework for continuous 3D trajectory optimization in cluttered environments, leveraging online neural Euclidean Signed Distance Fields (ESDFs). Unlike prior approaches that rely on discretized ESDF grids with interpolation, our method directly optimizes smooth trajectories represented by fifth-order polynomials over a continuous neural ESDF, ensuring precise gradient information throughout the entire trajectory. The framework integrates a two-stage nonlinear optimization pipeline that balances efficiency, safety and smoothness. Experimental results demonstrate that C-3TO produces collision-aware and dynamically feasible trajectories. Moreover, its flexibility in defining local window sizes and optimization parameters enables straightforward adaptation to diverse user's needs without compromising performance. By combining continuous trajectory parameterization with a continuously updated neural ESDF, C-3TO establishes a robust and generalizable foundation for safe and efficient local replanning in aerial robotics.
comment: 8 pages, 5 figures, submitted and accepted in ICUAS 2026
Bio-inspired decision making in robot swarms under biases
Minimalistic robot swarms offer a scalable, robust, and cost-effective approach to performing complex tasks with the potential to transform applications in healthcare, disaster response, and environmental monitoring. However, coordinating such decentralised systems remains a fundamental challenge, particularly when robots are constrained in communication, computation, and memory. In our study, individual robots frequently make errors when sensing the environment, yet the swarm can rapidly and reliably reach consensus on the best among $n$ discrete options. We compare two canonical mechanisms of opinion dynamics -- direct-switch and cross-inhibition -- which are simple yet effective rules for collective information processing observed in biological systems across scales, from neural populations to insect colonies. We generalise the existing mean-field models by considering asocial biases influencing the opinion dynamics. While swarms using direct-switch reliably select the best option in absence of asocial dynamics, their performance deteriorates once such biases are introduced, often resulting in decision deadlocks. In contrast, bio-inspired cross-inhibition enables faster, more cohesive, accurate, robust, and scalable decisions across a wide range of biased conditions. Our findings provide theoretical and practical insights into the coordination of minimal swarms and offer insights that extend to a broad class of decentralised decision-making systems in biology and engineering.
AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning
Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.
FDIO: Frequency Decomposed Inertial Odometry
Pedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.
DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation
Recently, a number of simulation testing approaches have been proposed to generate diverse driving scenarios for autonomous driving systems (ADSs) testing. However, the behaviors of NPC vehicles in these scenarios generated by previous approaches are predefined and mutated before simulation execution, ignoring traffic signals and the behaviors of the Ego vehicle. Thus, a large number of the violations they found are induced by unrealistic behaviors of NPC vehicles, revealing no bugs of ADSs. Besides, the vast scenario search space of NPC behaviors during the iterative mutations limits the efficiency of previous approaches. To address these limitations, we propose a novel scenario-based testing framework, DynNPC, to generate more violation scenarios induced by the ADS. Specifically, DynNPC allows NPC vehicles to dynamically generate behaviors using different driving strategies during simulation execution based on traffic signals and the real-time behavior of the Ego vehicle. We compare DynNPC with state-of-the-art scenario-based testing approaches. Our evaluation has demonstrated the effectiveness and efficiency of DynNPC in finding more violation scenarios induced by the ADS.
comment: Accepted by TOSEM 2026
One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields
Motion planning for robotic manipulators is a fundamental problem in robotics. Classical optimization-based methods typically rely on the gradients of signed distance fields (SDFs) to impose collision-avoidance constraints. However, these methods are susceptible to local minima and may fail when the SDF gradients vanish. Recently, Configuration Space Distance Fields (CDFs) have been introduced, which directly model distances in the robot's configuration space. Unlike workspace SDFs, CDFs are differentiable almost everywhere and thus provide reliable gradient information. On the other hand, gradient-free approaches such as Model Predictive Path Integral (MPPI) control leverage long-horizon rollouts to achieve collision avoidance. While effective, these methods are computationally expensive due to the large number of trajectory samples, repeated collision checks, and the difficulty of designing cost functions with heterogeneous physical units. In this paper, we propose a framework that integrates CDFs with MPPI to enable direct navigation in the robot's configuration space. Leveraging CDF gradients, we unify the MPPI cost in joint-space and reduce the horizon to one step, substantially cutting computation while preserving collision avoidance in practice. We demonstrate that our approach achieves nearly 100% success rates in 2D environments and consistently high success rates in challenging 7-DOF Franka manipulator simulations with complex obstacles. Furthermore, our method attains control frequencies exceeding 750 Hz, substantially outperforming both optimization-based and standard MPPI baselines. These results highlight the effectiveness and efficiency of the proposed CDF-MPPI framework for high-dimensional motion planning.
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/
comment: 26 pages, 7 figures, 25 tables
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention CVPR 2026
Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.
comment: Accepted at CVPR 2026 (Highlight)
Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models
Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.
Explainable deep learning improves human mental models of self-driving cars
Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for faithfully explaining the behavior of machine-learning-based planners that causally grounds their reasoning in human-interpretable concepts without sacrificing performance. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the vehicle, allowing them to better predict its behavior, particularly in surprising situations. This demonstrates that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. We anticipate our method could be applied to other safety-critical systems, such as autonomous drones and robotic surgeons, as well as to other architectures, such as end-to-end learning systems and vision-language-action models. Overall, our study establishes a deployment-validated pathway to interpretability for autonomous agents, which could help make them more transparent and safe.
comment: MST & JAS contributed equally to this work
Multiagent Systems
Orchestrated Reality: From Role-Play to Living, Playable Game Worlds -- LLM-Driven World Simulation as a Parameterized-Action POMDP
Many games rely on storytelling combined with systems that track levelling, NPC behaviour, and consequence simulation; bridging tightly-authored narrative with deeply-simulated worlds -- most acute in sandbox and open-world settings -- has been prohibitively expensive. LLM-driven worlds open a new path: a single harness can coordinate numerical state, narrative voice, storytelling pacing, and rule logic together. Realising this requires the LLM system to sustain a persistent world (who is where, what has just happened, what is currently true), which today's deployed systems do not: the narrative voice asserts state in free prose without any validated representation, so a fully autonomous game engine remains infeasible. We treat this as an architectural choice, not a limitation of language models, and report work in progress on a framework -- orchestrated reality -- that makes the world a canonical object owned by a singleton orchestration agent analogous to the tabletop-RPG Game Master (GM). We formalise an LLM-driven game world for a human player as a Parameterized-Action POMDP: state is a tree of canonical JSON entities, actions decompose as $a=(k, x_k)$ (a discrete intent kind plus structured JSON parameters), the agent observes only a narrative projection $o=O(s)$ of state, and the transition kernel $F$ is an LLM-driven Plan-Diff-Validate-Apply (PDVA) pipeline that commits schema-validated, content-hashed JSON deltas. We give the formal model, a JSON-state example, a worked single-turn example, and a catalogue of 15 illustrative incidents drawn from a real deployment showing the framework in action. Empirical validation through a planned human player study -- together with multi-NPC concurrent agency and deployment as an RL environment -- is situated as future work.
comment: 9 pages, 2 figures. Work in progress. Yuhang Huang and Chenmiao Li contributed equall
DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts
Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim\!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.
SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills NeurIPS 2027
Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risk - natural-language directives that hijack an agent, exfiltrate data through encoded side channels, or chain harm across pipelines - so what is needed is a semantic, multi-dimensional vetting system rather than another signature matcher. We present SKILLVETBENCH, a live public leaderboard on Hugging Face that uses an LLM-as-Judge to vet agent skills. What is new: SARS (Skill Agentic Risk Score), a five-dimensional agentic-risk metric with a principled weighted formula for instruction-following systems. What is integrated: full CVSS v4.0 vector decomposition and a ClawHub dual-view that places our LLM-generated review beside the official marketplace verdict. What is demonstrated: drawing on our companion benchmark paper [ 1], the LLM-as-Judge stage achieves zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls, while the best static baseline (SKILLSIEVE) still misses 15%; for instruction-layer categories such as Prompt Injection and Memory Poisoning, conventional tools miss between 89% and 100% of threats (e.g., CODEBERT detects none of nine memory-poisoning skills). Detection rates vary from 35% to 95% across four LLM evaluators, motivating ensemble scoring in production deployments.
comment: The main research paper is submitted to NeurIPS 2027, it is in under review
Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably
We ask a structural question: given unreliable elementary problem-solvers, what organizations of them solve hard problems reliably, and what are the limits? We develop a $decomposition~algebra$: elementary solvers are morphisms in a stochastic category, and four combinators (sequential composition, parallel ensembling, verification gating, and recursive reduction) generate the space of compound solvers. We equip this algebra with two homomorphisms, a $reliability$ valuation into the ordered monoid $([0,1],\le)$ and a $cost$ valuation into a commutative semiring, and we derive the composition laws that govern how reliability flows through structure. Our central results are (i) a $verification~odds~law$ (the result that names this report), showing that a verification gate multiplies the odds of correctness by the verifier's likelihood ratio $Λ$, so that $k$ conditionally independent gates yield geometric amplification; (ii) a $reliability~amplification~theorem$, giving target reliability $1-δ$ at $O(\log 1/δ)$ verification depth whenever $Λ>1$; and (iii) a $threshold~dichotomy$: above the critical parameters reliability can be driven arbitrarily close to one at logarithmic cost, while at or below them no amplification is possible. We then show that $self-organization$ is the least fixed point of a monotone improvement operator on the complete lattice of strategies, and that this fixed point equalizes marginal log-odds gain per unit cost. Finally, we prove matching limits: an information ceiling bounds per-gate amplification by a divergence quantity; shared error causes create a strictly positive voting floor, so diversity is $necessary$ for unbounded amplification. Reliability, in short, is neither free nor magical: it is bought with independent information, arranged by composition, and bounded by the verifier.
comment: 10 pages, 2 figures
AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan
Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.
Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems
Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.
RSPECT: Robust and Scalable Planner for Energy-Aware Coordination of UAV-UGV Teams in Aerial Monitoring
We consider the robust planning of energy-constrained unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which act as mobile charging stations, to perform long-horizon aerial monitoring missions. More specifically, given a set of points to be visited by the UAVs and desired final positions of the UAV-UGV teams, the objective is to find a robust plan (the vehicle trajectories) that can be realized without a major revision in the face of uncertainty (e.g., unknown obstacles/terrain, wind) to complete this mission in minimum time. We provide a formal description of this problem as a mixed-integer program (MIP), which is NP-hard. Since exact solution methods are computationally intractable for such problems, we propose RSPECT, a scalable and efficient heuristic. We provide theoretical results on the complexity of our algorithm and the feasibility and robustness of resulting plans. We also demonstrate the performance of our method via simulations and experiments.
comment: Accepted to the Journal of Intelligent & Robotic Systems (JINT)
Bio-inspired decision making in robot swarms under biases
Minimalistic robot swarms offer a scalable, robust, and cost-effective approach to performing complex tasks with the potential to transform applications in healthcare, disaster response, and environmental monitoring. However, coordinating such decentralised systems remains a fundamental challenge, particularly when robots are constrained in communication, computation, and memory. In our study, individual robots frequently make errors when sensing the environment, yet the swarm can rapidly and reliably reach consensus on the best among $n$ discrete options. We compare two canonical mechanisms of opinion dynamics -- direct-switch and cross-inhibition -- which are simple yet effective rules for collective information processing observed in biological systems across scales, from neural populations to insect colonies. We generalise the existing mean-field models by considering asocial biases influencing the opinion dynamics. While swarms using direct-switch reliably select the best option in absence of asocial dynamics, their performance deteriorates once such biases are introduced, often resulting in decision deadlocks. In contrast, bio-inspired cross-inhibition enables faster, more cohesive, accurate, robust, and scalable decisions across a wide range of biased conditions. Our findings provide theoretical and practical insights into the coordination of minimal swarms and offer insights that extend to a broad class of decentralised decision-making systems in biology and engineering.
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.
comment: v2 (erratum): two truncated Gemini 3.1 Pro libraries regenerated; cooperative-plurality 9/12->10/12, conclusions unchanged. 11 pages, 3 figures, 8 tables. Extends arXiv:2501.16173. Code and n=500 replication: https://github.com/arqFranciscoLeon/evollm (archived: https://doi.org/10.5281/zenodo.20248615)
Truth, Justice, and Secrecy: Cake Cutting Under Privacy Constraints AAAI 2026
Cake-cutting algorithms, which aim to fairly allocate a continuous resource based on individual agent preferences, have seen significant progress over the past two decades. Much of the research has concentrated on fairness, with comparatively less attention given to other important aspects. Chen et al. (2010) introduced an algorithm that, in addition to ensuring fairness, was strategyproof -- meaning agents had no incentive to misreport their valuations. However, even in the absence of strategic incentives to misreport, agents may still hesitate to reveal their true preferences due to privacy concerns (e.g., when allocating advertising time between firms, revealing preferences could inadvertently expose planned marketing strategies or product launch timelines). In this work, we extend the strategyproof algorithm of Chen et al. by introducing a privacy-preserving dimension. To the best of our knowledge, we present the first private cake-cutting protocol, and, in addition, this protocol is also envy-free and strategyproof. Our approach replaces the algorithm's centralized computation with a novel adaptation of cryptographic techniques, enabling privacy without compromising fairness or strategyproofness. Thus, our protocol encourages agents to report their true preferences not only because they are not incentivized to lie, but also because they are protected from having their preferences exposed.
comment: This is the full version of our paper published in the Proceedings of AAAI 2026
A Simple Hierarchical Causality Primer
We provide a brief primer for the idea behind formalising hierarchical causality in the context of complex systems. Here actors are not simply agents. Actors instantiate causation classes. Agents implement local dynamics in given levels or organisation in a given system. Hierarchical causality then describes how actor-level roles constrain, select, and organise agent-level behaviour across levels. The system then necessarily requires three additional structures. First, causation classes to abstract a given form of causal influence that an actor instantiates. Second, aggregation operators to move across the levels. Third, discrete event-time maps are required because the system comprises events, and the relation between local event counts and any global clock must be specified. Our formulation here is purposefully simple and discrete.
comment: 15 pages, 1 figure; short technical primer with a toy example in an appendix, corrected minor typos, refined the admissible kernel notation, added a rough Appendix B with the start of an outline of an experimental design requirements specification based on feedback from colleagues and readers (thank you for the feedback and comments)
The Axiom of Consent: Friction Dynamics in Multi-Agent Coordination
Multi-agent systems must coordinate despite heterogeneous preferences, asymmetric stakes, and imperfect information. When coordination fails, friction emerges: measurable resistance such as deadlock, thrashing, or conflict. We derive a formal framework for coordination friction from a single axiom: actions affecting agents require their authorization in proportion to stakes. From this axiom of consent we establish the kernel triple $(α, σ, \varepsilon)$ -- alignment, stake, and entropy -- as sufficient statistics for any resource-allocation configuration, and propose a friction functional whose simplest candidate form $F = σ(1+\varepsilon)/(1+α)$ predicts that friction rises with stakes and entropy and falls with alignment. We stress that this form is a phenomenological ansatz, not a theorem -- the simplest expression satisfying our desiderata -- whose empirical adequacy, in particular whether the alignment dependence is monotone, remains open. A companion study tests it in a multi-agent reinforcement-learning environment, finds the linear alignment dependence falsified by a U-shaped relationship, and motivates a quadratic form $F = σ(1+\varepsilon)/(1+α^2)$ that we characterize axiomatically as a refinement for future confirmation. The Replicator-Optimization Mechanism governs selection over coordination strategies: lower-friction configurations persist longer, making consent-respecting arrangements dynamical attractors rather than normative ideals. We give formal definitions for resource consent, coordination legitimacy, and friction-aware allocation, a measurement apparatus, and machine-checked Lean 4 proofs of the core comparative-statics. Illustrative applications to cryptocurrency governance and political legitimacy show one architecture spanning domains, offered as candidate unification, not established identity.
comment: 89 pages, 3 appendices
Learning to Share: Selective Memory for Efficient Parallel Agentic Systems ICML 2026
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/
comment: ICML 2026
Systems and Control (EESS)
Anisotropic Template Ansätze for Robust Positive Invariance under State-Dependent Uncertainty
We establish sufficient conditions for robust positive invariance under state- and input-dependent disturbances with anisotropic covariance structure. The proposed ansatz maps a fixed ellipsoidal template through a GP-derived positive-definite matrix field, subsuming scalar homothetic scaling while retaining finite graph-based verification. The resulting LMI conditions couple the learned field to Schur-stable dynamics; an isotropic fallback with inflation factor $r=1/(1-γ_{\mathrm{cl}})$ proves admissibility. During each learning epoch the field is frozen, so online tube evaluation is one GP covariance query and a small matrix square root, with no online set iteration or LMI solve. Quadrotor simulations show a $195\times$ reduction in 3D velocity-tube volume and a $2.1{\times}10^5$ reduction in the joint 7D velocity-control subspace relative to a non-adaptive homothetic baseline. This extended version adds full proofs, a separated offline/online complexity analysis, and controller-sweep, contraction, and projection-area studies.
A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation CEC
Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.
comment: This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore
SparseCol: A 1320 BTOPS/W Precision-scalable NPU Exploiting Training-free Structured Bit-level Sparsity and Dynamic Dataflow SC
Bit-serial computation enables sequential processing of data at the bit level, providing several advantages, such as scalable computational precision. This approach has gained significant attention, especially for exploiting bit-level sparsity in AI workloads. While current bit-serial processors leverage bit-level sparsity to eliminate the computation associated with zero bits, they face a fundamental trade-off: either they suffer from low memory-access and computation efficiency caused by irregular patterns of non-zero bits, or they incur substantial area overhead from complex online scheduling mechanisms required to reorganize bit-level data and preserve memory access and computation regularity. Therefore, we present the SparseCol processor, designed to harness extensive bit sparsity while maintaining high hardware utilization across various AI applications, including CNNs, RNNs, and transformers. In contrast to traditional methods, SparseCol exploits structured bit-level sparsity, denoted by bit-column sparsity, without requiring any re-training. Furthermore, SparseCol implements a dynamic dataflow architecture that tackles hardware under-utilization issues commonly found in existing bit-serial solutions. Fabricated in 16nm CMOS node, SparseCol delivers 1320 BTOPS/W (BTOPS represents Binary Tera-Operations Per Second, calculated as #W bits x #A bits TOPS) peak efficiency while maintaining accuracy, outperforming SotA sparse processors in terms of efficiency by 6.8x. Comprehensive evaluations on CNN classification tasks and transformer architectures demonstrate system-level efficiencies of 745.02 BTOPS/W and 850.5 BTOPS/W, respectively.
comment: 14 pages, 18 figues, IEEE: Journal of Solid-State Circuits (JSSC)
Friction Characterization of a Cable-Driven Differential Actuation System for Lower-Limb Exoskeletons
Lower-limb exoskeletons require actuation systems that can provide accurate joint torque control while preserving low mass and encumbrance. Conventional architectures often rely on independently actuated joints and joint-level torque sensors, increasing system complexity and weight. This paper presents a novel differential actuation architecture for hip-knee flexion/extension, enabling cooperative torque sharing between two motors via a linear differential mapping between motor and joint. To compensate for transmission losses, a model-based friction estimation strategy is developed and experimentally implemented, allowing accurate joint torque estimation without the need for torque sensors. The proposed solution is validated on a physical prototype, demonstrating the feasibility of sensorless torque estimation in a differentially actuated hip-knee module of a lower-limb exoskeleton.
comment: Accepted for presentation IEEE RAS/EMBS 11th International Conference on Biomedical Robotics and Biomechatronics
SINR-Aware Base Station Deployment in Wide Area IoT Sensor Networks
The rapid expansion of Internet of Things (IoT) applications necessitates the effective deployment of base stations (BSs) to enable consistent connectivity across large geographic areas under interference-limited conditions. Existing techniques typically use distance-based or binary coverage models; however, these abstractions fail to account for the influence of co-channel interference on the quality of communication in dense deployments. In this paper, we investigate the Signal-to-Interference-plus-Noise Ratio (SINR)-aware Base Station Deployment (BSD) problem in wide-area IoT sensor networks. The objective is to determine a minimum-cost subset of BSs from a predefined set of candidate BSs such that every IoT sensor is covered by at least one BS and a target SINR threshold is satisfied. The problem is formulated as a combinatorial optimization problem, which is NP-hard. Theoretical analysis establishes that the proposed coverage function is monotone and submodular, enabling the SINR-aware greedy algorithm to achieve a (1-1/e)-approximation to the optimal solution while maintaining a polynomial-time computational complexity. Numerical evaluations on a real water distribution network dataset demonstrate that the proposed SINR-aware greedy algorithm achieves near-optimal base station deployment while significantly reducing computational effort. Compared with the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) algorithms, the proposed approach attains complete sensor coverage with deployment costs within 12.3% of the best-performing metaheuristic solution while requiring up to 190 times lower execution time.
comment: 6 pages, 4 figures. Submitted to a conference, under review
Artificial Intelligence for Power-Converter-Rich Electrical Systems: A Review
Power-converter-rich electrical systems, formed by renewable generation, electrified transportation, and inverter-based resources, exhibit strongly nonlinear dynamics, multi-physics design tradeoffs, fast control requirements, and growing reliability and cybersecurity constraints. These characteristics strain workflows that rely only on physics-based modeling, sequential optimization, and rule-based operation. This paper reviews artificial intelligence (AI) for power-converter-rich electrical systems through a life-cycle and deployment-readiness perspective. The literature is organized across converter design, real-time control, system-level operation, and compliance-oriented governance. For design, we examine surrogate modeling, topology and parameter synthesis, EMI/EMC-aware optimization, reliability-oriented design, and knowledge-assisted workflows. For control, we compare supervised learning, reinforcement learning, learning-augmented predictive control, and safety-constrained learning according to their role in closed-loop implementation. For operations, we focus on microgrid coordination, forecasting, distribution-system observability, privacy-preserving coordination, and cyber-resilient operation where converter-interfaced resources shape the operating problem. Across these stages, the review emphasizes deployment-critical gaps, including stability certification, constraint satisfaction, interpretability, extrapolation, data efficiency, sim-to-real transfer, embedded latency, cybersecurity, privacy, and standards alignment. The resulting taxonomy is intended to clarify where AI is already useful as an engineering support tool and where further validation is needed before autonomous or safety-critical deployment.
Stability Analysis in Multi-Constraint Safety Filters for Linear Systems
Multi-constraint safety filters based on control barrier functions for linear systems with affine state constraints yield continuous piecewise-affine closed-loop dynamics and may introduce boundary equilibria and unstable active-set modes. Although they guarantee forward invariance, they can change nominal stability, and it remains unclear when unstable modes cause divergence versus bounded, convergent behavior. This paper develops a geometric framework to separate these cases: leveraging explicit active-set realizations, we show that equilibria associated with nonempty active sets lie on the corresponding constraint faces and that any unstable directions are tangent to those faces due to exponential enforcement of the active constraints. We characterize mode stability via a minimum-phase test, certify divergence under fixed active sets using recession cones, and derive tractable linear-matrix-inequality conditions for global exponential stability or boundedness using Lyapunov and LaSalle arguments.
Fast Convergence and Robustness for Two-Layered Forgetting Recursive Least Square under Finite Excitation
Under nonpersistent excitation (non-PE) conditions, conventional methods such as exponential forgetting (EF) or directional forgetting (DF) recursive least squares (RLS) that rely on direct regressor vectors exhibit inherent limitations in terms of stability guarantees for parameter errors, robustness to system changes, and convergence rates. To address these limitations, this study introduces a novel two-layer forgetting RLS (TLF-RLS) identification method based on an augmented regressor matrix constructed using DF, which ensures global exponential stability and enhances robustness under non-PE condition. However, the convergence rate of the parameter is strongly dependent on the forgetting factor because of the introduction of EF in the outer layer, which causes an estimation windup under non-PE condition. To address this issue, a novel reconfiguration-based EF (ReEF) algorithm is proposed, which is achieved through variable- and matrix-based forgetting related to the magnitude of the eigenvalues of the current covariance matrix. Theoretical analysis indicates that TLF-RLS with ReEF algorithm guarantees uniform ultimate boundedness of the condition number under mild assumptions. Consequently, the proposed method resolves the trade-off between fast parameter convergence and robustness in both transient and steady-state responses under changes in system characteristics. Numerical simulations of three aforementioned cases demonstrate the effectiveness of the proposed method.
comment: 17 pages, 14 figures
An Optimal Power Management Policy for Hydrogen-based Hybrid Aero Engines
This paper presents a power management policy for a hydrogen-based hybrid aero engine combining a gas turbine and a solid oxide fuel cell (SOFC). Specifically, we first identify a quadratic quasi-steady-state model of the propulsion system and formulate the minimum-fuel optimal control problem as a function of the power split between gas turbine and SOFC that captures the interconnections between the components and accounts for their operational limits. Second, leveraging the Karush-Kuhn-Tucker optimality conditions and partial convexity and monotonicity model properties, we compute the globally optimal steady-state power split for the different phases of the flight in closed form. Finally, we verify this power management policy with a high-fidelity integrated static model %simulator across different flight phases, revealing in less than 1.5 % normalized root mean square error in power allocation and less than 0.7 % in predicted fuel consumption. Our results show that the optimal power management policy can be translated into a heuristic control law requesting the highest SOFC power that does not exceed its maximum operating temperature, ultimately paving the way for minimal-effort on-board implementations.
comment: V1-Replacement
AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems
The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.
An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines
Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as Node.js, Next.js, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.
comment: 25 pages, 24 figures
OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning
Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.
comment: 34 pages, 28 figures
Age and Stability Trade-offs in Remote Monitoring Systems
Timely information is important in a wide variety of Internet of Things (IoT) services in which a shared server must manage two competing tasks: (i) processing a queue of jobs, and (ii) generating status updates to a remote monitor. This creates a fundamental trade-off between queue stability and data freshness. In this work, we model this scheduling decision as a Markov Decision Process (MDP) with the objective of minimizing a weighted sum of the average Age of Information (AoI) and the average queue length. We show that the optimal scheduling strategy is a queue-dependent age threshold which is monotonic. The shape of the switching curve differs according to different priority regimes. Finally, we compare the optimal MDP policy against heuristic policies.
Nodal Frequency Stability-Constrained UC & ED for Renewable-Dominated Power Systems
In modern power systems with high shares of renewable, inverter-based resources (IBRs), frequency stability becomes more complex due to the fast dynamics of IBRs and frequency trajectories that vary significantly from bus to bus. In this paper, we present an optimization framework for unit commitment and economic dispatch with endogenous frequency stability constraints at each bus. Two approaches for mitigating excessively low instantaneous frequency values in the event of the largest generator contingency are proposed: 1) by introducing a constraint requiring more thermal generation, and 2) by constraining the maximum power output of the generator that had the largest power output in the incumbent solution. Both approaches proved effective in eliminating dispatch scenarios that resulted in instantaneous frequencies below 58 Hz, while the second approach minimized the difference in production cost values from the non-stability-constrained case. Overall, the results indicate that the proposed optimization framework is a more effective alternative to frequency stability-constrained unit commitment and economic dispatch (UC & ED) than those based on the center-of-inertia (COI) principle.
Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC
We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.
Quaternionic Pole Placement via Companion Forms and the Ackermann Formula
We present an extension of state-feedback pole placement for quaternionic systems, based on companion forms and the Ackermann formula. For controllable single-input quaternionic LTI models, we define a companion polynomial that annihilates its companion matrix, characterize spectra via right-eigenvalue similarity classes, and prove coefficient-matching design in controllable coordinates. We then derive a coordinate-free Ackermann gain expression valid for real target polynomials, and state its scope and limitations. Short examples demonstrate correctness, practical use, and numerical simplicity.
comment: 8 pages. Accepted manuscript / author final version. The IEEE Early Access version is available in IEEE Xplore with DOI 10.1109/TAC.2026.3702917; the article is accepted for publication in a future issue of IEEE Transactions on Automatic Control and may still be copy-edited. Co-funded by the European Union under the project ROBOPROX (reg. no. CZ.02.01.01/00/22_008/0004590)
Using Dynamic Safety Margins as Control Barrier Functions
This paper presents an approach to design control barrier functions (CBFs) for arbitrary state and input constraints using tools from the reference governor literature. In particular, it is shown that dynamic safety margins (DSMs) are CBFs for an augmented system obtained by concatenating the state with a virtual reference. The proposed approach is agnostic to the relative degree and can handle multiple state and input constraints using the control-sharing property of CBFs. The construction of CBFs using Lyapunov-based DSMs is then investigated in further detail. Numerical simulations show that the method outperforms existing DSM-based approaches, while also guaranteeing safety and persistent feasibility of the associated optimization program.
comment: 12 pages, 5 figures, 2 tables
Passive Lifted FIR Filters for Nonlinear System Identification
Passivity is a fundamental property of physical systems. In data-driven modeling, ensuring that a learned model preserves this structural property is critical to avoiding instability in close loop. Although linear passive system identification is well-established, nonlinear extensions remain challenging. We propose nonlinear operators defined through passivity-preserving lifting of linear passive FIR filters. Passivity is enforced efficiently through frequency-domain constraints, and the nonlinear lifting includes output feedback for expressivity. Numerical and real-world experiments demonstrate the framework capabilities, including the computational advantage of frequency-domain constraints against LMI-based alternatives.
comment: 8 pages, 4 figures
Descriptive versus Regulatory Uncertainty in Bounded Predictive Systems
Any system that models the world under finite representational capacity must compress; any compression entails a prior; and the prior is the system's bias. What has not been established is whether uncertainty participates in the dynamics governing future behavior, or merely describes the output distribution without consequence. We introduce a structural distinction between descriptive uncertainty, which does not recursively modulate the system's policy, and regulatory uncertainty, which directly enters the optimization landscape and drives persistent adaptive restructuring. We prove formally that current transformer architectures are confined to descriptive uncertainty at inference. We ground this in thermodynamics via Landauer's principle: for uncertainty to be regulatory, epistemic error must cost real energy; in a decoupled system, hallucinations and correct derivations dissipate identical energy. We test this empirically across three locally-deployed language models (3B, 8B, 70B parameters). Token-level Shannon entropy is statistically invariant across tasks spanning pattern retrieval, causal operator application, and out-of-distribution causal generalization in all three models (all pairwise p >= 0.568; within-model ranges 0.011-0.028 nats), while task accuracy varies substantially across the same conditions (0%-100%). Entropy and accuracy are orthogonal. The decoupling is scale-invariant: larger models achieve higher accuracy but identical entropy flatness. This structural incapacity is not resolvable by additional parameters or training data. Genuine epistemic grounding requires physical coupling between thermodynamic substrate state and information processing cost.
Robotics
Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities
Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera
Understanding and Modeling Perceived Cognitive and Physical Strain Dynamics for Planning-Oriented Human-Robot Collaboration in Prefabricated Construction
Human-robot collaboration (HRC) in prefabricated construction requires planning approaches that consider not only productivity but also time-dependent worker states during repeated work and rest. Existing planning models often rely on simplified assumptions about fatigue, workload, or recovery, with limited domain-specific empirical evidence on how perceived strain evolves. This study develops an empirically grounded, planning-oriented approach to characterize perceived strain accumulation and recovery in prefabricated construction HRC. A controlled repeated work-rest experiment assessed perceived cognitive and physical strain using the Rating Scale for Mental Effort and Borg's Rating of Perceived Exertion. Linear and exponential functional forms were evaluated, followed by mixed-effects modeling to examine collaborative conditions, session effects, and inter-individual variability. Results indicate that cognitive strain accumulation is best represented by a linear mixed-effects model, whereas rest-phase recovery follows nonlinear decay. The resulting planning-oriented models may inform future human-state-aware task allocation and scheduling research.
comment: 53 pages, 15 figures
FD-SLAM: Fast Dense Radar-Inertial SLAM with Frequency-Domain Loop Closure and Pose Graph Optimization
Radar SLAM is attractive for autonomous ground vehicles operating in visually degraded environments, however, scanning radars are noisy, have low scanning rates, and their measurements are challenging to match reliably over long trajectories. This paper presents FD-SLAM, a fast dense radar-inertial SLAM system that extends dense radar-inertial odometry with frequency-domain loop closure and pose graph optimization. The proposed method preserves an image-like structure of scanning radar measurements by using a compact frequency-domain polar descriptor for loop-candidate retrieval and a multi-stage verification pipeline based on temporal filtering, phase-correlation screening, scan-alignment similarity, and geometric consistency checks. Verified loop closures are added as non-sequential constraints in an SE(2) pose graph together with radar-inertial odometry factors. FD-SLAM is evaluated on a publicly available dataset using standard KITTI evaluation metrics. The results show that FD-SLAM improves FD-RIO baseline, achieves competitive performance against current state-of-the-art radar SLAM methods, and provides favorable rotational accuracy across multiple evaluated driving trajectories. Runtime analysis further indicates that the radar-inertial front-end operates above the radar sampling rate on a CPU-only setup, while loop closure detection and graph optimization remain suitable for parallel background execution.
FARM: Find Anything using Relational Spatial Memory
Robots operating in homes, warehouses, and other object-rich environments need memory systems that can find specific object instances on demand. Object-level memory alone is often insufficient: scenes contain many plausibly matching objects, and users refer to the target through relations to landmarks and surrounding objects (e.g. ``the tall lamp below the dartboard and to the left of the poster''), demanding a relational spatial memory that supports retrieval through semantic, appearance, and spatial predicates over objects. To achieve this, we present FARM (Find Anything using Relational Spatial Memory), which builds, in real time at 5-10 Hz, a compact, open-vocabulary, object-level memory with geometry, visual-language descriptors, and viewpoint evidence. At query time, FARM uses VLMs to parse the query and score visual evidence, while grounding spatial constraints explicitly through object symbols and relational predicates. This structured use of VLMs enables more accurate and robust retrieval than end-to-end reasoning over frame histories or scene-graph context. In experiments on 44k language queries spanning 67 indoor and outdoor scenes, ranging from 15 to 15,000 m^2, FARM improves Recall@5 and Recall@10 over prior methods by 164% and 224%, and a final VLM reranking stage improves Accuracy@1 by 35%, while running in real time. We further demonstrate closed-loop deployment on a quadrupedal robot using onboard sensors and compute.
Learning Context-Aware Neural ODE Dynamics for Adaptive Robotic Control
Robotic systems deployed in uncertain and dynamically changing environments often face variations in contact conditions, aerodynamic effects, and external disturbances that challenge reliable control. To remain effective under model-based control, these systems require dynamics models that can adapt to such changes, especially when direct access to complete environmental information is limited. To enable adaptability and facilitate integration with model predictive control, we propose a context-aware dynamics model based on neural ordinary differential equations, which infers environmental factors from state-action histories using a two-phase training procedure. We validate the approach across diverse robotic platforms, including a quadrotor in simulation, as well as a Sphero BOLT robot and a Fanuc manipulator in real-world experiments. The results demonstrate that our method effectively adapts to temporally and spatially varying environmental changes across different tasks. Videos are available at https://youtu.be/PY0sNyF2rqE , and the source code is available at https://github.com/syyu410-yu/context-aware-neural-ode-control.git .
A Bilateral Teleoperation Framework for Dexterous Manipulation
Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.
comment: 4 pages, 7 figures, 1 appendix,
A Corridor-Scale CARLA-VISSIM Co-Simulation Framework for Multi-Intersection Urban Traffic
This paper presents an implemented CARLA-VISSIM co-simulation framework for an urban corridor comprising approximately fifteen connected intersections centered on Martin Luther King Jr. Boulevard in Chattanooga, Tennessee. The system integrates CARLA 0.10.0 Unreal Engine 5 with PTV VISSIM 2026 through a bidirectional, step-synchronized interface that couples VISSIM's microscopic vehicle, pedestrian, and signal-controller logic with CARLA's high-fidelity 3D rendering. A LiDAR-derived elevation model and RoadRunner-based High Definition (HD) map provide terrain-accurate road geometry deployed consistently across both simulators. The framework incorporates explicit actor ownership, mirrored lifecycle management, coordinate reconciliation, and a latest-state-per-actor update policy, enabling stable interaction between VISSIM-controlled traffic and a CARLA-controlled ego vehicle. A corridor-scale case study demonstrates consistent traffic-signal mirroring, synchronized vehicle-pedestrian interactions, and stable mixed-authority operation under peak loads of approximately 100 vehicles and 100 pedestrians. The deployment captures the interaction of the five signalized intersections along MLK Street and their connecting upstream and downstream intersections, revealing synchronization challenges unique to multi-intersection corridors. Results indicate that this MLK-centered corridor provides an effective testbed for verifying cross-simulator consistency and that the proposed architecture supports reliable, perception-ready co-simulation for corridor-level traffic studies.
A Hybrid Model-Based and Model-Free Framework for Active Multi-View Viewpoint Optimization in Sonar Target Recognition
This paper presents a hybrid model-based and model-free framework for active multi-view target recognition using forward-looking sonar. A convolutional neural network (CNN) provides data-driven observation likelihoods, while Radon-based orientation estimation enables viewpoint-aware sensing without requiring angle annotations. During training, an information-gain-based reward guides a Proximal Policy Optimization (PPO) agent to learn a belief-aware viewpoint selection policy offline. At deployment, the learned policy performs real-time viewpoint selection using only CNN-based belief updates, eliminating the need for computationally expensive online POMDP tree search. Experiments on a marine-debris forward-looking sonar dataset demonstrate that the proposed approach achieves competitive recognition accuracy while reducing sensing steps and motion cost compared to model-based baselines.
Robust Conformal CBF and CLF Controllers via Iterative Policy Updates
Conformal prediction (CP) has been used to obtain probabilistic bounds on the error between a learned dynamics model and the true but unknown system. Such CP bounds can then be embedded into robust control Lyapunov function (CLF) and control barrier function (CBF) frameworks. However, such an approach does not retain stability/safety guarantees because of the distribution shift between the closed-loop trajectory distribution under the deployed CLF/CBF policy and the trajectory distribution from which the CP bound and its guarantees were derived. To address this issue, we propose an episodic framework that iteratively updates the robust conformal CLF/CBF policy while maintaining stability/safety guarantees across episodes. We achieve this by (1) using adversarially robust conformal prediction, and (2) quantifying a distribution shift budget that allows us to control how much the model error can increase across policy updates. This distribution shift budget is derived via a closed-loop trajectory sensitivity analysis, yielding an implicit and an explicit update rule for the CP bound. We analyze convergence of our algorithm, which we demonstrate on three case studies. To the best of our knowledge, these are the first results that provide stability/safety guarantees for robust conformal CBF/CLF policies.
SimWeaver: Zero-Shot RGB Sim-to-Real for Deformable Manipulation
RGB sim-to-real for deformable manipulation has remained largely unsolved without real-world fine-tuning. We present SimWeaver, which trains zero-shot RGB VLA policies on 200 simulated demonstrations per task, reaching above 80% per-task and 91% average real-world success across 5 diverse deformable tasks including plastic-bag manipulation, without teleoperation or per-task calibration. SimWeaver combines a reliable measurement-backed simulator (SimWeaver-Sim) with an extensible asset framework supporting single-image generation(SimWeaver-Asset), a deterministic topology-aware trajectory synthesizer (SimWeaver-Syn), and a sim-to-real protocol with ISP-aware photometric augmentation (SimWeaver-Real). On silk grasping, the sim-trained policy reaches 100% under visual distribution shifts where real-data baselines drop to 9-70%, at two orders of magnitude lower per-trajectory cost. We will release SimWeaver and a representative asset subset. Project page: https://simweaver.github.io/
Covariance-Regulated Recursive Koopman Learning for Nonlinear Systems with Uncertain Time-Varying Dynamics
Offline models for autonomous robots often fail under time-varying dynamics outside their training distribution. Koopman operator theory offers a linear representation of nonlinear dynamics via lifting, but its transition to real-time recursive estimation may suffer numerical vulnerabilities: covariance windup under low excitation when using exponential forgetting, and vanishing gain without forgetting. This paper introduces a Covariance-Regulated Recursive Koopman Learning (CR-RKL) framework with two complementary strategies--error dead-zone gating and constant-trace normalization--each independently capable of preventing covariance explosion and parameter freezing, with the latter additionally preserving the geometric structure of uncertainty. Validated on a non-holonomic differential-drive robot with wheel slip and Stribeck friction and on a 26-gram butterfly-inspired flapping-wing micro aerial vehicle, CR-RKL achieves numerically stable and accurate online modeling, and when embedded in model predictive control, it maintains reliable tracking performance under uncertain, time-varying dynamics.
Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance
Emergency collision avoidance under extreme driving conditions demands safety-critical control that accounts for both obstacle proximity and vehicle dynamic stability over a future time horizon, yet existing methods often rely on instantaneous or local safety evaluations. This paper proposes a safe reinforcement learning framework guided by a Hamilton-Jacobi (HJ) reachability based motion safety set that provides forward-looking safety supervision for constrained policy optimization. Specifically, a unified signed safety function is formulated by combining geometric collision margins and chassis stability limits, and is then extended through reachability analysis into a finite-horizon motion safety set that characterizes whether safety can be maintained under future vehicle state evolution. To enable practical computation, the motion safety set is approximated from offline extreme driving data, mitigating the computational burden of grid-based HJ solvers. The learned motion safety set is then embedded as a continuous safety cost into a constrained Markov decision process, and a PID-Lagrangian policy optimization scheme is employed to adaptively regulate the Lagrange multiplier for safety constraint enforcement. Simulation and real-vehicle experiments on low-adhesion obstacle-avoidance scenarios demonstrate that the proposed method achieves higher goal-reaching rates, produces smoother avoidance maneuvers, and maintains larger unified safety margins than baseline methods.
comment: Preprint
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models
Vision-Language-Action models (VLAs) have demonstrated strong task understanding and generalization in robotic manipulation, yet the high computational cost of full-model inference limits their deployment in low-latency, high-frequency closed-loop control. We propose an asynchronous semantic-action decoupling framework that separates semantic understanding from action generation along the internal semantic-action interface of existing VLAs, without redesigning the vision-language backbone or introducing an external planner. A low-frequency understanding module asynchronously updates reusable semantic conditions, while a high-frequency action module continuously outputs control actions without repeatedly invoking the full model. To mitigate the temporal mismatch between stale semantics and the current execution state, we further introduce historical action conditioning and time-misalignment training, which provide short-horizon execution context and improve feedback control robustness under stale semantic conditions. Experiments on LIBERO with $π_{0.5}$ and UniVLA, together with real-robot deployment using UniVLA, show that the proposed framework achieves up to 35.6 Hz server-side action-module inference throughput and offers a low-intrusion path to high-frequency closed-loop control without running full VLA inference at control rate.
OSDAG: Online Scheduling for Efficient Multi-Robot Collaboration
Coordinating heterogeneous multi-robot systems (MRS) for complex, long-horizon tasks requires both flexible high-level reasoning and efficient low-level scheduling. Existing LLM-based approaches address the reasoning side but introduce two critical bottlenecks: (1) repeated LLM inference during execution, which inflates latency with agent count, and (2) offline, pre-committed scheduling, which forces robots to idle while waiting for sequentially ordered predecessors even when independent work is available. This paper presents OSDAG, a novel framework that integrates LLM-based task reasoning with Directed Acyclic Graph (DAG) representation and constraint-aware online scheduling. The LLM is invoked once to decompose a natural-language instruction into a dependency-annotated task graph, and a lightweight online scheduler then allocates ready tasks to idle agents in real time. The DAG representation encodes both precedence and resource constraints, ensuring correctness while exposing all available parallelism. Experiments across five benchmark scenarios demonstrate that OSDAG achieves 5-15x faster reasoning time compared to dialogue-based methods, reduces makespan by up to 38% over sequential baselines, and maintains competitive success rates. Both simulation and real-world experiments on dual-arm manipulation tasks validate the effectiveness and practicality of the proposed approach for efficient multi-robot coordination. The website and resources are available at http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot
Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility
Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.
Co-Creating Buildable and Open Social Robot Study Companions with University Students
Open-source social robots offer accessibility, repairability, and student empowerment, yet the build itself often presents a barrier. Existing platforms either ship pre-assembled, foreclosing hands-on learning, or expose students to unfamiliar fasteners, opaque wiring, and inaccessible service points that erode engagement. Whether targeted mechanical redesign can lower this barrier whilst maintaining structural integrity remains untested. Here we show that Design for Assembly (DfA) and Design for Disassembly (DfD) interventions reshape how a build feels before they shorten how long it takes. Working with university students in Guyana and Estonia, we applied the Double Diamond framework to co-create the Robot Study Companion (RSC) v4.1: mapping pain points, then redesigning its chassis around twist-lock fasteners, snap-fit joints, and tool-free service latches. Across two studies with developers and first-time builders, system usability climbed from Poor to Excellent (SUS 59.4 to 89.4), perceived workload trended downward (NASA-TLX 4.29 to 4.00), and mean assembly time trended downward (21.4 to 13.7 minutes, with juniors' learning effect), whilst orientation cues and navigation continuity for first-time builders emerged as the next documentation frontier. Perceived workload, not completion time, appears to govern whether students take up open hardware.
comment: Accepted for 18th International Conference on Social Robotics (ICSR + ART 2026), London, UK | 1-4 July 2026
Rethinking Implicit Spatial Representation in Visuomotor Policy Learning
Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.
Seam-to-Graph Reconstruction for Garment Configuration Alignment
Seams encode rich structural information about garments but are frequently partially observable in robotic manipulation scenarios. To robustly leverage seam information, we propose a Seam-to-Graph network based on graph neural networks and attention mechanisms. This network maps unstructured seam observations to a topology-encoded structural skeleton graph for real-time garment state estimation. Using this skeleton-graph-based state estimation, we design a deformation-aware, hierarchical visual servoing controller for garment configuration alignment. We implement this controller on a bimanual robot system to load a garment onto a screen printing platen and to align it to the desired configuration precisely. Real-robot experiments demonstrate that the robot using the proposed method not only achieves human-level alignment accuracy with reduced variance in alignment error but is also robust to different garments. These results demonstrate that the use of seam information is effective for garment manipulation.
comment: 11 pages, 9 figures
VLALeaks: Membership Inference Attacks against Vision-Language-Action Models
Vision-Language-Action (VLA) models enable end-to-end robot control and have garnered widespread attention. However, the memorization of training data inherent to VLA, coupled with the high cost of robotic data acquisition, raises serious concerns regarding data privacy leakage and intellectual property infringement. Membership inference attacks (MIAs) aim to determine whether a given sample belongs to the training set. While representing a significant privacy threat, this attack remains underexplored in the context of VLA models. To bridge this gap, we propose VLALeaks, which is based on attention discrepancies in VLA models. We reveal, for the first time, the privacy vulnerabilities of VLA models. Specifically, it comprises a two-stage process: (1) membership feature extraction, and (2) attack model construction. Experimental results across multiple VLA benchmarks demonstrate that VLALeaks readily reveals membership information and achieves optimal attack AUC and TPR@1\%FPR, highlighting the privacy vulnerabilities in current VLA model deployments. Our work is the first systematic study of MIAs on VLA models, aiming to provide insights for secure and trustworthy VLA models.
comment: Security and Privacy
Task-Aware Environment Augmentation for Reliable Navigation via Shielded Conditional Diffusion
Reliable trajectory planning under partial observability depends not only on computing a feasible geometric path, but also on whether the robot receives informative observations while executing that trajectory. Existing approaches usually keep the environment fixed and adapt the robot through belief-space planning, active localization, or added sensing, often incurring costly uncertainty propagation and brittle behavior in observation-poor regions. We flip this perspective and address the largely open problem of \emph{task-aware environment augmentation}: given a mapped environment, a planned task trajectory, and a small budget of visual fiducial markers, where should the environment be augmented so that the planned trajectory can be executed reliably under uncertainty? Our key observation is that useful marker layouts are defined by the localization support they provide along the task trajectory: a small number of well-timed observations can be sufficient to prevent uncertainty from accumulating in regions where state-estimation error would otherwise compromise control. Building on this observation, we present \tbp{SCoDA}, $\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation. \tbp{SCoDA} learns a conditional distribution over high-performing fiducial layouts from data, using the environment, planned trajectory, disturbance context, and desired execution profile as conditioning. Its shielded sampler reasons over where along the planned execution pose corrections should occur, and steers this distribution toward task-relevant, finite-budget augmentations. Across simulated benchmarks and hardware deployments, we show that \tbp{SCoDA} improves trajectory execution reliability and completion time over strong baselines. Code, models and dataset available at: \hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}
MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency
Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.
MotionVLA: Vision-Language-Action Model for Humanoid Motion
Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
Self-Driving Negotiator: An interactive, verifiable benchmark for social negotiation and theory of mind under hidden intent
Autonomous driving is full of tiny social negotiations: a driver presses forward, another yields, a pedestrian fakes toward the curb, or a lane vehicle chooses whether to open a merge gap. Such interactions require inferring hidden intent from behavior under partial observability and then acting safely and efficiently. Existing autonomous-driving language benchmarks mostly focus on perception, visual question answering, or open-loop planning, while existing language-agent negotiation benchmarks typically make the negotiation explicit in text. Self-Driving Negotiator bridges the gap between the two: a text-only, multi-turn, procedurally generated environment for measuring implicit social coordination in driving. Agents generate specific driving actions. Reward and diagnostics are computed from the privileged simulator state, not from the explanation of the model. This report covers task design, reward and anti-gaming invariants, validated scenarios, non-LLM baselines, and a six-model inference leaderboard. Current models are far removed from the scripted expert. The best average success rate across three scenarios is 0.68; contested merge is statistically flat across models; and difficulty tiers separate cue-following from true wait-for-commitment behavior.
DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.
comment: Code: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2
Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models ICML 2026
Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.
comment: Accepted at ICML 2026
Design and Fabrication of a Spin Coater with In-Situ Optical Measurement for Soft Thin Films
Spin coating is widely used for fabrication of thin polymer and elastomer films, yet reliable thickness verification of highly compliant materials remains challenging due to deformation from contact-based measurements and the cost and complexity of conventional optical metrology. Accurate thickness control is especially critical in soft elastomer applications such as dielectric elastomer actuators (DEAs), where mechanical and functional performance scales strongly with film thickness. This work presents a low-cost, primarily 3D-printed benchtop spin coater with an integrated, minimally deforming optical thickness measurement system for soft-film fabrication workflows. The system is designed to manufacture films between 50 and 300 microns thick with repeatability within 10 microns. Thickness is measured in-situ by tracking displacement of a reflected laser beam via quadrant photodetector, avoiding significant deformation. Optical geometry, sensor linearity constraints, and structural validation via finite element analysis are discussed. Experimental validation using calibrated metal shims demonstrated a thickness resolution of 3.6-3.7 microns and best-case measurement repeatability of 13 microns (95 percent confidence interval). The platform repeatably produced silicone films within 9 microns of target thickness, demonstrating that accessible optical metrology can be integrated into a low-cost spin coating system for practical, thickness-controlled fabrication of compliant thin films without specialized industrial instrumentation.
comment: 8 pages, 7 figures, 5 tables. To be published in the conference proceedings for AIM 2026
Latent Action Pretraining Through World Modeling
Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $π_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is able to transfer learned knowledge across tasks, environments, and embodiments. It outperforms models pretrained with ground-truth robot actions and other similar pretraining methods on the LIBERO benchmark and real-world setup, while being efficient and practical for real-world settings.
SimCoachCorpus: A naturalistic dataset with language and trajectories for embodied teaching KDD
High-quality curated datasets are essential for training and evaluating AI approaches, but are often lacking in embodied interactive domains where language and physical action are intertwined. In particular, few datasets capture how people acquire motor skills in embodied tasks through verbal instruction over time. To address this gap, we introduce SimCoachCorpus: a unique dataset of race car simulator driving that enables the investigation of rich phenomena during guided and unguided motor skill acquisition. In this dataset, 29 humans were asked to drive in a driving simulator around a race track for approximately ninety minutes. Fifteen participants received one-on-one instruction from a professional performance driving coach, and 14 participants drove without coaching instruction. SimCoachCorpus includes features such as vehicle state and inputs, map (track boundaries and race-line), and cone landmarks. Additionally, these are synchronized with the coach's concurrent verbal feedback and additional terminal feedback at the end of each lap. We also provide high-quality annotations of high-level coaching categories for each concurrent feedback utterance, ratings on students' compliance with coaching advice, and self-reported cognitive load and emotional state of participants (gathered from surveys during the study). The final dataset includes over 20,000 concurrent feedback utterances, over 400 terminal feedback utterances, and over 40 hours of interactive driving data. Our naturalistic interactive dataset can be used to investigate motor learning dynamics, explore linguistic phenomena, and train computational models of teaching and learning. We demonstrate applications of this dataset for in-context learning, imitation learning, and topic modeling. Data is hosted at https://doi.org/10.7910/DVN/W7VTKZ and code is available at https://github.com/ToyotaResearchInstitute/sim_coach_corpus
comment: This is an extended version of a paper accepted to KDD Datasets & Benchmarks Track 2026
CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture
Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in association accuracy and identification precision scores with a lower number of identity switches.
comment: 8 pages, 5 figures, and 4 tables
Intelligent Sailing Model for Open Sea Navigation
Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved.
GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains
Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.
From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges ICML 2026
Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. ResVLA also demonstrates strong performance in real-world robot experiments.
comment: Accepted to ICML 2026
WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation
Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/
MapDream: Task-Driven Map Learning for Vision-Language Navigation
Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.
Action with Visual Primitives
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 37.04% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.
comment: 9 pages, 6 figures. Project page: https://kingdroper.github.io/AVP/
Multiagent Systems
CoAgent: Concurrency Control for Multi-Agent Systems ATC 2026
Multi-agent LLM systems -- coding agents, devops agents, document agents -- now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort-and-retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre-Order), fixes a serialization order at launch, serves each read the order-filtered value, and applies writes speculatively in place; a one-way notification asks an affected reader to re-judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga-style inverse each tool registers in advance. At quiescence the run is serializable in the pre-decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint-declared, undoable tools online. On ten contended workloads, CoAgent stays within 5\% of serial correctness at a $1.4\times$ speedup and near-serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash-only target system, it grows a 25-tool library online and lifts the task pass rate from 45/71 to 63/71 at $0.80\times$ the time and $0.86\times$ the cost.
comment: 14 pages, 7 figures. Submitted to ATC 2026
Probing Dec-POMDP Reasoning in Cooperative MARL AAMAS 2026
Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.
comment: To appear at the 25th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2026), added DOI
TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning
This paper presents an agentic multimodal retrieval-augmented generation (RAG) framework for domain-specific literature reasoning, instantiated on a curated corpus of several thousand papers in intelligent tires, vehicle dynamics, vehicle control, sensing, estimation, and machine learning. Unlike conventional single-pass RAG systems, the proposed architecture uses an autonomous, evidence-gated pipeline that classifies query intent, generates separate text and visual query rewrites, performs hybrid text retrieval with FAISS and BM25 followed by cross-encoder reranking, expands evidence through graph-guided chunk traversal over a Neo4j knowledge graph, and retrieves visual document evidence using ColSmol late-interaction embeddings with MUVERA fixed-dimensional encoding, approximate nearest-neighbor search, and MaxSim reranking. The framework scores evidence sufficiency using a 100-point rubric with hybrid rule-based/LLM review, retries retrieval through drift-guarded reformulation, searches external academic databases through optimize--search--vet loops, merges and deduplicates multimodal evidence, verifies citation integrity, and generates cited answers through Planner, Researcher, Writer, and Critic agents with self-correcting revision. Key contributions include: (i) a scalable multimodal retrieval architecture combining text, graph, and visual evidence over 40,000 document pages; (ii) an interpretable evidence sufficiency and retry mechanism; (iii) a multi-agent generation pipeline with evidence mapping and critic-driven revision; (iv) a domain knowledge graph with LLM-based entity extraction, OpenAlex author validation, and intra-corpus citation resolution; and (v) a route-dependent external search architecture for targeted literature expansion. The result is a practical, evidence-gated, multimodal agentic RAG architecture for technical reasoning over specialized research corpora.
MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis
Clinical diagnosis is a gradual process of evidence integration, in which physicians move from symptoms and medical history to examinations, competing hypotheses, disease relations, and treatment decisions. Large language models have advanced medical text understanding and generation. Yet their clinical use remains limited by weak evidence grounding, opaque reasoning, and inconsistent links among differential diagnosis, final diagnosis, diagnostic basis, and treatment planning. We introduce MedCollab, a multi-agent framework for full-cycle clinical diagnosis and report generation. MedCollab coordinates specialist and examination agents according to patient records. It structures agent deliberation with an Issue-Based Information System (IBIS) protocol, so that each diagnostic position is supported by patient-specific evidence and medical knowledge. It also builds Hierarchical Disease Relation Chains (HDRC) to connect accepted hypotheses through progression, complication, and comorbidity relations. During multi-round deliberation, a verifier-guided consensus module evaluates evidence support, medical plausibility, and logical conflicts. It then adjusts agent contributions and filters unsupported reasoning. Experiments on ClinicalBench and MIMIC-IV show that MedCollab outperforms leading LLMs and medical multi-agent baselines in diagnostic accuracy, evidence consistency, and clinical reasoning quality. These results indicate that structured and auditable collaboration can produce more faithful and clinically coherent diagnostic reports.
SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.
From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability
Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships: it can infer the pending workflow and, at machine speed, act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone. We formalize a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting, which we measure to be comparable to other machine traffic, but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, evaluate candidate transports, and give an A2A case study where a metadata-protecting binding surfaces the protocol's implicit identity assumptions. On a generative model anchored to a real capture and over a live A2A binding, a label-blind classifier recovers a task's class from passive metadata well above chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. The leverage of acting on the leak is distinct from recoverability: under a fixed budget an adversary realizes most of a clairvoyant attacker's advantage from a workflow's opening, governed by precision over the top-ranked workflows rather than overall accuracy, so a defense suppresses it even while recovery stays above chance.
comment: 18 pages, 6 figures, 6 tables
Interpretation as Linear Transformation: A Cognitive-Geometric Model of Concepts and Meaning
This paper develops a geometric framework for modeling concepts, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Evaluative concepts are formalized as structured vectors, abstract beings, whose transmission is mediated by linear interpretation maps. An abstract being survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and concept death. Within this framework, I show how conceptual distortion, motivational drift, and the limits of mutual understanding arise from purely algebraic constraints. A central result, the No-Null-Space Leadership Condition, characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries. The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing conceptual dynamics across heterogeneous agents.
comment: The revised draft w.r.t. reviewer comments. The code is at https://github.com/DarkEyes/Cognitive-Geometry
The Illusion of Multi-Agent Advantage
Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.
HCP-MAD:Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
Multi-Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra-round topologies and inter-round interactions separately, limiting the adaptation of token costs to task complexity. This work introduces Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate (HCP-MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair-agent debates, while complex tasks require expanded collaboration. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, Heterogeneous Pair-Agent Debate applies an adaptive stopping criterion to terminate mutual critique of reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across six benchmarks show that HCP-MAD enhances accuracy while substantially reducing token costs. Code is https://github.com/fuyu66/HCP-MAD.
Systems and Control (EESS)
Positive-Real Identification of Sparse Mori-Hamiltonians from Partial Observations
Discovering the governing equations of a physical system from data is a central goal across the sciences, yet in most experiments only a few states are accessible while the rest stay hidden. Existing approaches treat this partial observability as an obstacle to be removed by first reconstructing the hidden state -- a step that is ill-posed under noise and that discards the physical constraints, such as energy conservation, that the true dynamics obey. We show that for conservative (Hamiltonian) systems no reconstruction is needed: projecting the dynamics onto the measured coordinates yields a memory kernel that we prove to be a lossless positive-real rational matrix, whose poles are the hidden natural frequencies and whose positive-semidefinite residues encode the couplings. The governing equation -- and the underlying Hamiltonian -- can therefore be read directly from the autocorrelation of the measured signal, with guarantees of uniqueness and physical passivity, and without neural networks. We validate the approach on linear, nonlinear, and chaotic systems under realistic noise. By recovering interpretable equations of motion that conserve energy by construction from partial measurements, the method offers a common tool for problems spanning mechanics, fluid and plasma physics, and beyond.
LLM4RTL: Tool-Assisted LLM for RTL Generation
Large language models (LLMs) have facilitated impressive progress in software engineering, code generation, tooling, and systems. Concurrently, a significant body of research has developed which explores a growing variety of methods and systems for applying LLMs to hardware and chip design (e.g., systems for RTL code generation based on functional description). However, when it comes to open Verilog/RTL code-generation, we need high-quality training samples to build specialized and more effective LLM systems through fine-tuning or low-rank adaptation. Here, we propose a ``judge-renew-check-renew-check'' (JRCRC) pipeline which updates a current public dataset using a hierarchy of state-of-the-art commercial LLM models differing in their costs and capabilities in RTL code generation. This approach achieves a cost-effective mechanism for filtering and refining code-generation samples into a higher-quality training dataset. Our experiments also identify some common weaknesses of LLMs in rule-based reasoning and logic, and consequently, in RTL code-generation. Having identified these weaknesses, we develop an architecture for incorporating pre-processing tools to dynamically assist the LLMs in inferring logical relationships from tabular data formats. With our tools-assisted architecture for RTL code generation, we achieve significant overall performance gains in the VerilogEval benchmark and outperform many state-of-the-art methods. Our LLM4RTL system achieves performance comparable to that of GPT-4O using a significantly much smaller LLM.
On Type Deception in Linear-Quadratic Differential Games
We consider two-player linear-quadratic differential games of incomplete information, in which one player has a private type initially unknown to the other. The typed player has incentive to conceal their type, while the uninformed player has the potential to infer it during play. Any ex-ante equilibrium in this setting will decompose into a deceptive, pooling phase, and a complete-information, revelatory phase. We demonstrate how to solve both phases via nested Riccati equations. Candidate equilibria are then found by maximizing the game value over a scalar revelation time, for which we provide a gradient in the case of time-homogeneous system matrices. We conclude by demonstrating our framework in a pursuit-evasion game with time-varying control advantages, finding interior optimal revelation times that confirm deception has quantifiable ex-ante value.
comment: accepted version (L-CSS 2026)
A Bilateral Teleoperation Framework for Dexterous Manipulation
Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.
comment: 4 pages, 7 figures, 1 appendix,
Minimum settling-time PI control of pure delay processes under a hard non-overshoot constraint: exact boundary-contact characterization and the role of the MID point
We solve, exactly, the problem of minimum settling-time PI control of a pure delay process K e^{-Ls} under the hard time-domain constraint of zero overshoot, y(t) <= 1 for all t. The closed loop is a neutral delay system whose step response is piecewise polynomial on the delay segments, with geometrically decaying jump discontinuities at the segment boundaries t = kL. The constrained optimum is characterized by an equioscillation-type contact structure whose active contacts sit at echo boundaries: kink maxima grazing the setpoint, jumps landing on the settling-band edge, and boundary troughs anchored to it. The number of contact equations equals the number of gains, so the optimum is exactly computable for every band delta. In a closed-form regime, delta in [(3-2 sqrt2)/4, (3-2 sqrt2)/2] approx [4.29%, 8.58%], the optimal gains are independent of delta: K Kp = 1 - sqrt2/2, K Ki L = sqrt2/2, and the optimal settling time is Ts*(delta) = (4 - sqrt2 - 2 sqrt(delta)) L. Outside this window the optimum solves an explicit two-equation polynomial system per regime, and Ts*(delta) is a staircase with exact flats at integer multiples of L from jump-landing pinning. As delta -> 0 the optimal gains converge to K Kp = e^{-2}, K Ki L = 4 e^{-2}, the generic multiplicity-induced-dominancy (GMID) point of the neutral quasipolynomial. The GMID response satisfies the hard constraint and uniquely maximizes the decay rate; yet at every finite delta the delta-adapted optimum strictly beats the fixed GMID tuning, by about 40% at delta = 2%. The MID point is thus the limit of the optimal gains without ever being the optimal tuning. A numerical extension to first-order-plus-time-delay plants quantifies the speed/robustness trade across Ms in [1.39, 1.76].
Data Center Life Cycle Co-Design Optimization
Liquid cooled supercomputers dissipate tens of megawatts of waste heat through cooling plants organized as parallel subloops that serve coolant distribution units. The number of subloops and the assignment of units to them are design decisions fixed at construction, yet they have not been systematically optimized for facilities at this scale. As electricity grids decarbonize, embodied carbon becomes a larger share of facility life cycle emissions and the cost of an unnecessary subloop becomes harder to justify. We present a framework that integrates operational energy from a validated control optimizer based on sequential least squares programming, embodied carbon from a bill of materials, and expected unplanned downtime from a per subloop reliability model. The framework is applied to the Frontier supercomputer, evaluating all 611 ways of partitioning its 25 coolant distribution units into two through six subloops. The life cycle cost and carbon optimum is found at two subloops holding 14 and 11 units, achieving 3,320.7 tonnes of carbon dioxide equivalent and $3.99 million over a seven year horizon, a saving of 50.2 tonnes and $100,000 compared to built four subloop configuration. The optimum remains on the Pareto front in all 15 scenarios of a one at a time sensitivity sweep. A semi-analytical decision rule generalizes the result, predicting four subloops for Aurora, two for El Capitan, and one for LUMI. When reliability is treated as a hard constraint set by operations policy, the four subloop Frontier deployment is consistent with the constrained optimum.
comment: 17 pages, 8 figures
Robust Conformal CBF and CLF Controllers via Iterative Policy Updates
Conformal prediction (CP) has been used to obtain probabilistic bounds on the error between a learned dynamics model and the true but unknown system. Such CP bounds can then be embedded into robust control Lyapunov function (CLF) and control barrier function (CBF) frameworks. However, such an approach does not retain stability/safety guarantees because of the distribution shift between the closed-loop trajectory distribution under the deployed CLF/CBF policy and the trajectory distribution from which the CP bound and its guarantees were derived. To address this issue, we propose an episodic framework that iteratively updates the robust conformal CLF/CBF policy while maintaining stability/safety guarantees across episodes. We achieve this by (1) using adversarially robust conformal prediction, and (2) quantifying a distribution shift budget that allows us to control how much the model error can increase across policy updates. This distribution shift budget is derived via a closed-loop trajectory sensitivity analysis, yielding an implicit and an explicit update rule for the CP bound. We analyze convergence of our algorithm, which we demonstrate on three case studies. To the best of our knowledge, these are the first results that provide stability/safety guarantees for robust conformal CBF/CLF policies.
Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance
Emergency collision avoidance under extreme driving conditions demands safety-critical control that accounts for both obstacle proximity and vehicle dynamic stability over a future time horizon, yet existing methods often rely on instantaneous or local safety evaluations. This paper proposes a safe reinforcement learning framework guided by a Hamilton-Jacobi (HJ) reachability based motion safety set that provides forward-looking safety supervision for constrained policy optimization. Specifically, a unified signed safety function is formulated by combining geometric collision margins and chassis stability limits, and is then extended through reachability analysis into a finite-horizon motion safety set that characterizes whether safety can be maintained under future vehicle state evolution. To enable practical computation, the motion safety set is approximated from offline extreme driving data, mitigating the computational burden of grid-based HJ solvers. The learned motion safety set is then embedded as a continuous safety cost into a constrained Markov decision process, and a PID-Lagrangian policy optimization scheme is employed to adaptively regulate the Lagrange multiplier for safety constraint enforcement. Simulation and real-vehicle experiments on low-adhesion obstacle-avoidance scenarios demonstrate that the proposed method achieves higher goal-reaching rates, produces smoother avoidance maneuvers, and maintains larger unified safety margins than baseline methods.
comment: Preprint
Revisiting The PBH Test: Fast Uncontrollability Certificates via Krylov Methods
This letter revisits the classical PBH test through the lens of finite-horizon reachability. By casting state transfer as a minimum energy, primal optimization problem, we show that unreachable state-space maneuvers admit dual infeasibility certificates. These certificates are computable without forming the controllability matrix meaning that uncontrollability can be efficiently certified. We prove that any such certificate is a linear combination of uncontrollable generalized eigenvectors, thereby providing a spectral interpretation without a global eigendecomposition. We also devise algorithms based on Krylov sub-space methods that extract some of the uncontrollable PBH modes from a certificate and demonstrate favorable scaling on large dynamic networks with thousands of nodes.
Differentially Private Consensus for Time-Delay Multi-agent Systems
This paper is concerned with the differentially private consensus problem for discrete-time multi-agent systems with communication delays. The purpose of the paper is to achieve differentially private consensus for such systems while protecting the entire delayed initial histories of all agents. A novel adjacency relation for delayed histories is introduced, and a Laplace-noise-based privacy mechanism is developed, where the noise variance is allowed to vary with time and even increase. By using the difference resolvent function method, decay estimates for the fundamental solutions of the delayed difference equations are derived. Based on these estimates and a backstepping technique, mean square weak consensus, mean square strong consensus, and almost sure strong consensus are established. The estimates for the fundamental solutions are also used to derive an explicit sensitivity bound. Furthermore, a constructive parameter design is provided to achieve a prescribed infinite-horizon $ε^\star$-differential privacy level. Numerical simulations illustrate the theoretical results.
comment: 19 pages, 4 figures
Optimal Ground-to-Air Interception with Time-Varying Acceleration Bounds
This paper proposes novel optimal-control-based guidance laws for ground-to-air missiles with time-varying acceleration bounds. In such engagements, as the missile climbs in altitude, its acceleration bound decreases, which may lead to acceleration saturation and significant miss distances if not explicitly accounted for. The proposed guidance laws incorporate hard acceleration command constraints directly into a linear-quadratic optimal-control framework, in contrast to conventional unbounded or softly constrained approaches. Analytically based guidance laws are developed for linear zero-order and first-order strictly proper missile dynamics with arbitrary-order linear target dynamics. Unlike the constant hard-bound case with minimum-phase missile dynamics, time-varying acceleration command bounds permit an initial unsaturated interval in which the proposed guidance laws can anticipate future saturation and reshape the acceleration profile accordingly. This enables earlier maneuvers when the missile possesses greater low-altitude maneuverability, fundamentally altering the structure of the optimal solution. The proposed approach is evaluated in nonlinear simulations and compared with equivalent unbounded and softly constrained optimal guidance laws. The results demonstrate substantially improved interception performance under saturation, reduced tuning requirements compared to softly constrained guidance laws, and enhanced capability in challenging engagement scenarios.
comment: This work has been submitted for journal publication. 38 Pages, 10 figures
Adaptive Deep Koopman Operator for Vehicle Dynamics Modeling: A Physics-Informed and Tire-Force-Driven Approach
Accurate and adaptive modeling of vehicle dynamics is paramount for the safety of autonomous driving systems, particularly under extreme maneuvers and time-varying parameters. While Deep Koopman operator theory offers a promising global linearization framework, its online application faces a theoretical bottleneck: the high-dimensional lifted state space inherently induces a rank-deficient problem, rendering traditional recursive least squares based updates numerically unstable. To address this, we propose a novel tire-force-driven modeling framework with guaranteed online stability. First, an offline Deep Koopman model is constructed by embedding 7DOF dynamic equilibrium constraints into the learning objective, ensuring the structural fidelity and physical interpretability of the lifted manifold. Second, we theoretically reformulate the operator update in the rank-deficient space as a minimum-norm solution problem. A Physics-Informed Variable Step-Size Normalized Least Mean Squares (PI-VSS-NLMS) algorithm is proposed, which leverages the projection property of NLMS to act as a stable pseudo-inverse solver while incorporating an anchoring mechanism to suppress parameter drift. Extensive simulations on CarSim and Hardware-in-the-Loop validation on dSPACE MicroAutobox III confirm the superiority of the proposed algorithm. It achieves robust prediction accuracy under unseen excitations while guaranteeing real-time feasibility with an average execution time of 0.421 ms, thus bridging the gap between theoretical models and practical deployment.
REGRID-QAOA: A Resource-Efficient Graph-Reduced Hybrid QAOA Framework for Physics-Constrained Power System Islanding
Quantum computing has rapidly emerged as a powerful paradigm for tackling computationally demanding problems. In particular, quantum optimization shows strong promise for hard combinatorial problems in power systems, where increasing distributed energy penetration heightens the need for intentional islanding to maintain grid reliability and resilience. However, power system islanding is an NP-hard combinatorial optimization problem that becomes computationally prohibitive for classical solvers as network size grows, motivating the use of quantum computing as a promising alternative pipeline. This study develops a resource-efficient hybrid QAOA islanding framework that brings physics-constrained power-system partitioning into the quantum optimization workflow. The framework combines coherency-informed graph reduction, physics-aware constraint modeling, and structured post-processing to efficiently convert shallow-circuit QAOA samples into high-quality feasible islanding decisions without deep circuits or large shot budgets. The proposed framework is validated on the standard IEEE benchmark systems (9-, 14-, 24-, 30-, 39-, and 57-bus), demonstrating that the hybrid workflow achieves Gurobi-optimal solution quality with a clear quantum resource advantage over vanilla QAOA, while the resulting islanding solutions satisfy all physical feasibility requirements after network separation. This study establishes QAOA-based islanding as a viable quantum approach for critical infrastructure, with structured post-processing as the key enabler of quantum resource efficiency.
BT-MTD: Bus Traversal-based Moving Target Defense for Smart Grid
Moving Target Defense (MTD) is a proactive security strategy designed to enhance cyber-resilience by dynamically altering system parameters, thereby preventing adversaries from acquiring the critical information needed to execute stealth attacks. In this paper, we consider the case in which the operator modifies the admittance of branches to enable MTD, and focus on the problem of effectively protecting the system with fewer number of branch admittance modifications and shorter computational time. Specifically, we identify the ineffectual branches whose admittance modification do not contribute to the improvement of MTD effectiveness via theoretical analysis. Building on these insights, we propose the Bus Traversal-based MTD (BT-MTD), which is a bus-oriented algorithm that traverses over the buses of the network according to analytically derived guidelines. The performance of the BT-MTD is evaluated and compared with four existing strategies on standard IEEE test systems, demonstrating its robustness and superior performance in effectiveness, efficiency, and computational cost. The code of BT-MTD is available at: https://github.com/YJY101/BT-MTD.
Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems
In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.
Intelligent Sailing Model for Open Sea Navigation
Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved.
Quantization Robustness of Monotone Operator Equilibrium Networks
Monotone operator equilibrium networks are implicit-layer models whose output is the unique equilibrium of a monotone operator, guaranteeing existence, uniqueness, and convergence. When deployed on low-precision hardware, weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation of the underlying monotone inclusion. Convergence of the quantized solver is guaranteed whenever the spectral-norm weight perturbation is smaller than the monotonicity margin; the displacement between quantized and full-precision equilibria is bounded in terms of the perturbation size and margin; and a condition number characterizing the ratio of the operator norm to the margin links quantization precision to forward error. MNIST experiments confirm a phase transition at the predicted threshold: three- and four-bit post-training quantization diverge, while five-bit and above converge. The backward-pass guarantee enables quantization-aware training, which recovers provable convergence at four bits.
comment: 6 pages, 4 figures. Accepted for publication in IEEE Control Systems Letters (L-CSS)
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/
Analysis of a Distributed Optimization-Based Control Architecture for Inverter-Interfaced Virtual Power Plants
We develop a large-signal stability analysis for a sampled-data, optimization-based secondary controller for inverter-interfaced distributed energy resources in virtual power plants.
comment: 10 pages, no figures
Robotics
Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control
Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.
comment: Project page: https://instruct-particulate.github.io/
EgoGuide: Egocentric Guidance for Efficient Robot-Free Demonstration Collection and Learning
Robot learning from real-world demonstrations is currently constrained by data scaling. Universal Manipulation Interface (UMI) provides an efficient robot-free data collection interface, yet current UMI-style pipelines often collect redundant demonstrations and lack global scene context. To improve data efficiency, we present EgoGuide, a collection interface that records synchronized wrist and head/egocentric observations and couples them with online visual-geometric data quality guidance. We also introduce a Gated Egocentric Residual Policy for robust learning from a viewpoint-varying egocentric camera, allowing head/egocentric context to correct ambiguous local observations while preserving stable wrist-view control. Real-world experiments show that EgoGuide reduces the required number of data episodes and improves data efficiency. The residual policy further improves robustness under visual occlusion. Project Page: https://silicx.github.io/EgoGuide
Whole-Body Impedance Model Predictive Control for Safe Physical Human--Robot Interaction on Floating-Base Platforms
Floating-base robots must balance under rigid contact constraints while interacting safely with humans. Existing whole-body control~(WBC) frameworks allocate the full joint space to locomotion or rely on fixed-gain impedance feedback that accumulates steady-state error under sustained physical human--robot interaction~(pHRI) forces. This paper extends the authors' fixed-base two-layer Impedance MPC to floating-base platforms through a three-level architecture: a centroidal MPC plans contact forces over a 500\,ms horizon; a priority-driven WBC layer resolves balance into joint torques through contact-consistent null-space projection; and the residual null space is governed by a receding-horizon quadratic program~(QP) that predicts and rejects pHRI disturbances using a Kalman-augmented state. A contact-consistent feedback linearization reduces the arm end-effector plant to a double integrator with a \emph{constant} state matrix within each contact mode, enabling offline precomputation of the QP cost and ${\geq}1$\,kHz operation. A covariance-inflation protocol preserves the disturbance estimate across contact-mode switches, guaranteeing zero steady-state error under bounded constant pHRI loads, and an Impedance Equivalence Theorem shows the infinite-horizon limit recovers a classical task-space impedance law whose effective mass, damping, and stiffness adapt to posture and contact configuration. Simulations on a 17-DOF biped and the Unitree G1 humanoid validate the design.
Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency
Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.
comment: 20 pages, 5 figures, 7 tables. Preprint version
Impedance MPC with Disturbance Estimation for Dexterous Hand Control
Dexterous hands must simultaneously track precise finger trajectories and maintain safe, compliant contact -- objectives in tension for any fixed-gain controller. We present an actuator-agnostic Impedance Model Predictive Control (Impedance MPC) framework for dexterous fingers, instantiating the constant-$A_d$ offset-free architecture established for physical human-robot interaction (pHRI); its stability, recursive-feasibility, and input-to-state-stability guarantees are inherited by preserving the architectural assumptions. An algebraic feedforward reduces the tendon transmission -- hydraulic, cable, pneumatic, twisted-string, or series-elastic -- to a constant-coefficient double integrator, so the QP cost inverse is precomputed offline and a 10-step receding-horizon quadratic program runs at 500\,Hz while enforcing hard constraints on contact force (ISO/TS 15066), actuation limits, and jerk. An encoder-only augmented-Kalman disturbance state drives steady-state error to zero under any constant contact load. On a hydraulically actuated finger -- the worked example platform, adding pressure and cavitation constraints -- the 500\,Hz Kalman MPC attains 0.5\,mrad RMS, 0.1\,mrad steady-state, and 6.6\,mrad peak deflection under 1.5\,Nm contact: 183$\times$, 1500$\times$, and 23$\times$ better than classical impedance. The realized first-move stiffness (18$\to$323\,Nm/rad with update rate) is independently verified. The architecture scales to a 16-DOF LEAP Hand MuJoCo simulation, recovering from 2.5\,N grasp-load disturbances within 0.7\,s.
What Robots Do Matters More Than What They Look Like: Task Context Shapes Trust in Educational HRI
Socially assistive robots (SARs) are increasingly deployed in educational and information-sharing contexts, supported by advances in large language models that enable fluent real-time interaction. Despite the growing diversity of robot embodiments, it remains unclear whether a single robot appearance is appropriate across different interaction tasks or whether trust depends primarily on contextual factors. In this study, we examine how robot appearance and task type jointly influence trust in robots. Using a within-subjects video-based experiment (N = 81), participants evaluated three robots with distinct appearances while performing three educationally relevant tasks: teaching, procedural instruction, and personal-information discussion. Results from repeated-measures analyses show a strong main effect of task on trust, with participants reporting the highest trust during instructional guidance, moderate trust during teaching activities, and significantly lower trust when robots requested personal information. In contrast, robot appearance showed no significant main effect, and the interaction between appearance and task was marginal. These findings suggest that trust in human-robot interaction is shaped more strongly by task context than by physical embodiment alone. By focusing on future educators as end users, this work contributes empirical evidence toward task-aware robot deployment in educational environments and highlights the importance of aligning robot roles and behaviors with interaction goals rather than relying solely on anthropomorphic design.
comment: Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan
Sensitivity Shaping for Latent Modeling
Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.
ORCA: A Platform for Open-Source Dexterity Research
Robotics manipulation research increasingly focuses on two-finger parallel grippers for their effectiveness, affordability, and ease of teleoperation. Grippers are nonetheless limited by their form factor, often requiring bimanual setups even for simple reorientation tasks. Anthropomorphic hands are a more natural platform for dexterous robot learning -- closer to the human hand, and capable of learning from human video -- yet they remain hard to use in learning research: even where open and accessible hand hardware exists, the software for control, simulation, teleoperation, and retargeting is scattered in one-off code bases, and largely disconnected from the robot-learning ecosystem. In this work, we introduce the \orca~learning stack, an open-source research stack for dexterity as a first-class robot learning domain. Our \orca~stack unifies low-level control, simulation, teleoperation from a range of consumer platforms, and hand retargeting, behind a single interface, and integrates natively with popular robot-learning frameworks such as \lerobot, so dexterous hand researchers can leverage the same data, training, and evaluation pipelines used for non-dexterous robot learning. We demonstrate a complete end-to-end workflow, collecting expert demonstrations of an in-hand reorientation task by teleoperation with a consumer-grade VR headset, training an autonomous policy with \lerobot, and evaluating the learned policy in a fully reproducible and observable setup. We open-source the entire stack as a shared, reproducible foundation for dexterous-manipulation research.
comment: 15 pages
TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation
Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study \emph{delayed-evidence} tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses \emph{path signatures}: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace
Provably Safe, Yet Scalable Reinforcement Learning
Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.
Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera
Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.
comment: 15 pages
AERMANI-PLACE: Language Guided Object Placement with Aerial Manipulators
Object placement is a fundamental component of aerial manipulation tasks, yet existing systems typically require the desired placement position to be specified explicitly in metric coordinates. Such interfaces are not intuitive and require users to reason about coordinate frames and scene geometry, making them difficult to use in practical deployments. In contrast, humans often communicate spatial goals through a combination of language and pointing gestures. Inspired by this observation, we present AERMANI-PLACE, a framework for language-guided object placement with aerial manipulators. Given a scene image and a natural language instruction, an image editing model generates a modified version of the scene containing a visual marker that indicates where the object should be placed. This marker is then grounded into the physical environment using depth observations to recover a metric place point, after which a placement trajectory is generated and executed by the aerial manipulator. We evaluate the proposed approach on a test set of 100 language-guided placement tasks and demonstrate successful execution on a real aerial manipulation platform. Experimental results show that the proposed method reliably infers placement locations from language instructions with an average success rate of 87\% on the test-set and transfers effectively to real-world aerial manipulation with an average success rate of 72\%. Video: https://youtu.be/SgwwgLBsv0g
CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners
End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.
comment: 8pages 4figures
Kine2Go: Kinematic dataset for the Unitree Go2 robot with diverse gaits and motions
The recent popularity of robotics, combined with the steadily decreasing cost of robotic hardware, has lowered the entry barrier to robotics research and enabled rapid advancements in the field. One of the primary examples is the Unitree Go2 quadruped robot, which is often used by researchers in the areas of locomotion, navigation, control, and others. Many researchers use the Go2 robot in combination with techniques like imitation learning, reinforcement learning, and behavioral cloning to allow machine learning systems to take full control of the robot. At the same time, many of those techniques require demonstration data consisting of the robot's kinematics information and actions applied to the motors. Obtaining such data is difficult, requires building complex pipelines, and can take significant time. To aid in those kinds of efforts, we present Kine2Go - a dataset with 800 diverse gait kinematics trajectory motion data for the Unitree Go2 robot, derived from 40 distinct policies. Our pipeline accepts data from various quadruped morphologies and translates them to a Go2-compatible format. Then we use Reinforcement Learning to train policies following a given motion, and finally we gather data from those policies, which grants robust, perturbed kinematic data with corresponding motor-level actions.
comment: 9 pages, 6 figures
ForestBack: Breadcrumb-Based Pedestrian Dead Reckoning for Infrastructure-Free Return Navigation
Reliable return navigation remains an important challenge in GPS-denied environments where external positioning infrastructure may be unavailable or unreliable. This paper presents ForestBack, an infrastructure-free pedestrian return navigation framework based on breadcrumb-based pedestrian dead reckoning (PDR). The system records a user's walking route as a sequence of reversible breadcrumb nodes and generates reverse-path guidance without requiring GPS, Wi-Fi, Bluetooth beacons, or pre-installed infrastructure. ForestBack integrates acceleration-based step detection, adaptive step-length estimation, magnetometer-assisted heading estimation, barometric-altitude correction, and bidirectional breadcrumb path reconstruction. The system was evaluated using an indoor obstacle-avoidance route with five checkpoints, where the user navigated around a central obstacle. A dataset of 36 walking trials and 42,474 time-series samples was used for evaluation, including IMU signals, magnetometer readings, barometric variables, turn-event labels, ground-truth trajectories, baseline PDR outputs, proposed ForestBack outputs, and power-related measurements. Experimental results show that ForestBack reduced the mean RMSE from 1.129 m to 0.965 m compared with traditional PDR, corresponding to a 15.76% improvement. The mean final-position error was reduced from 1.781 m to 1.388 m, while turn-event detection consistency reached approximately 99.90%. These results indicate that ForestBack improves trajectory reconstruction and route-preserving return guidance in obstacle-avoidance scenarios. The released dataset and analysis notebook support reproducibility and future benchmarking of infrastructure-free PDR-based return navigation systems.
comment: 9 pages, 6 figures, 1 table, and 19 equations
Causal Object-Centric Models for Planning with Monte Carlo Tree Search
We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.
Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack
In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models
Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.
Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning
Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/
comment: 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)
FloVerse: Floor Plan-Guided Multi-Modal Navigation CVPR 2026
Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.
comment: Accepted at CVPR 2026
ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation
Diffusion-based Vision-Language-Action (VLA) policies have demonstrated strong capability in modeling expressive and multimodal action distributions. However, their reliance on iterative sampling introduces substantial inference latency, which limits their applicability to reactive closed-loop robot manipulation. To address this limitation, we propose \texttt{ReactVLA}, a lightweight and low-latency VLA framework for real-time robotic manipulation. \texttt{ReactVLA} combines two complementary designs: (1) an improved Mean Flow (iMF) action generator that reduces expensive multi-step diffusion sampling to one-to-few-step action generation, and (2) Attention Residuals (AttnRes), a dynamic depth-wise feature routing mechanism that replaces uniform residual accumulation to better preserve task-relevant multimodal representations. We evaluate \texttt{ReactVLA} on large-scale simulation benchmarks, including LIBERO and RoboIMI, as well as real-world robotic manipulation tasks. Experimental results show that \texttt{ReactVLA} consistently outperforms similarly sized VLA baselines, including SmolVLA and $π_0$. On challenging precision manipulation tasks, \texttt{ReactVLA} achieves up to a 1.65$\times$ improvement in task performance while providing more than a 4$\times$ increase in inference speed compared with leading VLA models. Finally, it reduces real-world policy latency to below 38.6 ms, enabling fast reactive control on physical robot platforms. Please check out our project website at: https://game-loader.github.io/ReactVLA/.
Optimality-Preserving Decomposition for Scalable QAOA in Natural-Language-Guided Multi-Drone Assignment
As multi-drone fleets scale, zone assignment rapidly evolves into an intractable NP-hard combinatorial problem that overwhelms classical exhaustive search. While quantum optimization promises to shatter these classical bottlenecks, mapping complex spatial tasks from human intent to restricted quantum hardware remains a severe challenge. To bridge this gap, we present an end-to-end framework integrating a fine-tuned Large Language Model (LLM) front-end with a highly scalable, domain-specific quantum-classical backend. The front-end utilizes Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to translate free-form natural language instructions into structurally robust Quadratic Unconstrained Binary Optimization (QUBO) constraints without false negatives. To overcome the strict qubit limits of near-term quantum devices, our framework features a novel constraint-preserving graph partitioner and a compressed separator-based dynamic programming (DP) merge. By structurally encoding constraints via W-state initialization and XY-mixers in Conditional Value-at-Risk Quantum Approximate Optimization (CVaR-QAOA), the pipeline stays highly compact. Empirical results demonstrate that this architecture circumvents classical scaling walls, recovering the global optimum on 100% of idealized oracle cases and 96.3% under real QAOA sampling, enabling natural-language-guided task allocation at previously intractable scales.
comment: 10 pages, 2 figures, 3 tables, preprint
SyLink Hand: A Synergy-Inspired Linkage-Driven Anthropomorphic Hand for Human-Like Dexterity
Designing anthropomorphic robotic hands that balance functional dexterity with mechanical simplicity remains a significant challenge. Inspired by human hand synergies, this paper presents the SyLink Hand, an anthropomorphic dexterous hand that integrates biomechanical synergy principles with linkage-driven transmission mechanisms to achieve a high degree of anthropomorphism in appearance, kinematics, and functionality within a compact and cost-effective architecture. Biomechanical analysis of natural hand motions using motion capture gloves reveals strong kinematic correlations among hand joints, providing the basis for a simplified yet functional degree-of-freedom (DOF) configuration. Guided by these synergistic characteristics, optimized linkage mechanisms are employed to coordinate multiple joint motions and reproduce natural finger trajectories. A novel spherical four-bar linkage is further proposed to achieve decoupled flexion/extension (Flex/Ext) and abduction/adduction (Abd/Add) at the metacarpophalangeal joint within a compact form factor. The resulting prototype integrates 19 joints driven by 11 actuators, with a total mass of 520g and a manufacturing cost of approximately USD 400. Experimental evaluations demonstrate its human-like kinematic performance, high load-bearing capability, and versatile grasping and manipulation skills. These results validate that the synergy-inspired, linkage-based design effectively balances anthropomorphism, mechanical simplicity, and functional versatility, highlighting its potential for practical deployment in dexterity-demanding robotic applications.
When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs
Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.
BIM-Loc: BIM-Integrated Discrepancy-Aware LiDAR-based Indoor Localization
Accurate and robust localization is a fundamental requirement for service and inspection robots, particularly in feature-sparse indoor environments where traditional systems struggle due to a lack of distinct landmarks. While prior maps can enhance robustness, precise and compact maps capturing real-world details are often unavailable for new or frequently changing environments. This paper presents BIM-Loc, a novel discrepancy-aware LiDAR-based localization method that directly integrates Building Information Models (BIM) from the design phase. BIM-Loc simultaneously estimates trajectories aligned with the BIM coordinate system and identifies discrepancies between real-world observations and the as-designed BIM in an online fashion. Our core contributions include: (1) a novel multi-hit ray casting strategy for efficient BIM-point data association and projection of 3D observations into 2D texture space; (2) a pose graph optimization framework with BIM-integrated factors that enforces consistency among odometry, sequential scans, and BIM structures; and (3) a hierarchical Bayesian inference module that incrementally updates a continuous 2D surface representation for discrepancy detection, propagating updates from the pixel to the structure level. Extensive evaluations in both simulation and real-world applications demonstrate that BIM-Loc significantly outperforms state-of-the-art map-based methods in localization accuracy and robustness.
comment: 24 pages, 21 figures, accepted by International Journal of Robotics Research (IJRR), to be published
Selective Agentic Recovery for UAV Autonomy with a Persistent Mission Runtime
Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.
comment: 17 pages, 2 figures. Preprint
Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback
For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.
Short-Horizon Position Accuracy of Single-Track Models: Implications for Motion Planning of Autonomous Vehicles
Accurate and computationally efficient vehicle models are essential for motion planning of autonomous vehicles, where positional accuracy directly affects trajectory feasibility and safety. However, the positional accuracy has not been systematically evaluated against real measurements. Therefore, this paper compares the short-horizon positional accuracy of three single-track vehicle models against vehicle measurements across various driving maneuvers. Model parameters are identified through dedicated experiments with the instrumented test vehicle. Rather than identifying a single best model, this work aims to provide insight into the trade-offs between model complexity, parameterization quality, and positional accuracy for informed model selection in Model Predictive Control applications.
comment: Submitted to The International Journal of Automotive Engineering, Official Journal of the Society of Automotive Engineers of Japan, Inc. (JSAE)
Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation
We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.
GAIT: Legged Robot Proprioceptive State Estimation with Attention over Inertial-Leg Tokens
In this paper, we propose a method that applies Inertial-Leg (IL) tokenization to an attention-based network for proprioceptive state estimation in legged robots. Unlike existing learning-based state estimators that concatenate all sensor measurements into a single flat vector, the proposed architecture represents inertial measurements and leg-wise measurements as individual tokens and uses an attention mechanism to learn the relative importance of each measurement.This design allows the network to reweight each measurement according to the current contact condition, reflecting the fact that the reliability of forward kinematic measurements depends on whether the corresponding foot is in contact. Unlike conventional contact-aided estimators, however, the proposed method learns this behavior without relying on an explicit contact estimator or on explicit measurement updates based on a stationary contact assumption. To validate the proposed method, we conducted experiments on a Unitree Go1 robot, including debris terrain not modeled in simulation and gait patterns not seen during training. Experimental results show that the proposed method achieves better estimation performance than existing learning-based state estimators under unseen gait patterns and also improves performance over contact-aided model-based methods.
Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic
Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.
comment: 23 pages, 5 figures, 8 tables
A Modular Dual-Arm Apple Harvesting Robot with Enhanced Field Performance
Robotic apple harvesting offers a promising solution to labor shortages in commercial orchards, but low throughput and poor performance in orchard environments hinder its commercial adoption. This paper presents a modular dual-arm apple harvesting robot that uses a vertically stacked arms to enable simultaneous operation in the upper and lower zones of a single tree, simplifying platform positioning from multi-tree lateral repositioning to single-tree stops. Compared to our prior horizontal dual-arm system, the platform integrates 5 advances: (1)a foundation-model-based perception pipeline combining Grounding-DINO and EfficientViT-SAM for robust fruit localization in unstructured outdoor environments; (2)7th-order jerk-bounded trajectory generation paired with a Control Barrier Function safety filter to achieve fast yet safe arm motions; (3)a linear sweep harvesting strategy with a 10cm approach buffer and rotational detachment that improves picking reliability; (4)a temporal-logic-based dual-arm coordination policy with vision-arm async scheduling that maximizes usage of a shared vacuum source; and (5)field validation in 2 commercial orchards covering different apple varieties and tree architectures during the 2025 harvest season. Across the 1738 arm cycles collected in these field trials, the system achieved an 80.0% per-attempt success rate and a mean per-arm cycle time of 7.53s. Fruit damage assessments confirmed that 91.2% of robotically harvested fruit retained the highest USDA grade (Extra Fancy), with bruise rates between 2.4% and 4.9%. With further improvements in the picking cycle time and handling of heavy foliage occlusions, this new modular robot design holds promise for commercial harvesting of apples.
Self-Improving VLA Policies: Selected Diffusion Noise for Spurious-Robust Action Smoothing
Diffusion-based Vision-Language-Action (VLA) policies enable strong generalization in robotic manipulation, but remain sensitive to spurious visual correlations and noisy action generation, leading to brittle behavior under perturbations. We introduce Selected Diffusion Noise (SDN), a simple, training-free test-time method that improves both robustness and success rate by leveraging the diffusion noise space as a controllable degree of freedom. SDN dynamically samples noise vectors that are maximally separated from a reference set to mitigate reliance on spurious cues, while selecting candidates that yield more coherent action trajectories. This dual objective encourages stable behavior even under object-masked observations and reduces action jitter without modifying model parameters. We evaluate SDN on two simulation benchmarks (Google Robot, Widow-X) and two real-world robotic datasets across multiple VLA policies, including pi_0, Groot-N1.5, and Groot-N1.6. SDN consistently improves success rates by +8% in simulation and +10% in real-world settings, while producing smoother and more stable actions. Our results highlight that diffusion noise selection can serve as an effective and general mechanism for enhancing VLA policies at test time.
The N2D Haptic Glove: A Multi-Finger Glove for 2D Directional Force Feedback for Contact Rich Manipulation
Humans rely on directional fingertip forces to probe and regulate contact during manipulation, yet most wearable haptic gloves render only vibration or single-axis force, leaving force direction ambiguous. Without directional cues, users must infer contact force from vision alone, often leading to over-pressing, inconsistent control, and reduced precision in robotic teleoperation. We present the N2D Haptic Glove, a multi-finger wearable device that renders planar flexion-extension fingertip forces using capstan-drive transmissions for high-transparency force feedback. Through benchtop validations and a user study involving haptic teleoperation of a robotic arm and hand, we demonstrate that compared to visual-only and single-axis haptic baselines, planar fingertip feedback significantly reduces contact force error during precise manipulation, improves trial-to-trial consistency, and enhances overall user experience in axial probing tasks. These findings establish the N2D Haptic Glove and directional finger-based haptics devices as a promising modality for contact-rich teleoperation, immersive virtual reality simulations, and robot learning from demonstrations. N2D Haptic Glove's hardware and software system will be fully open-sourced at \href{https://ucsdarclab.github.io/n2d-glove/}{this https URL}.
Development of a 3 in Sewer Pipe Inspection Robot with an Articulated Differential Mechanism using X-shaped Linkages
This paper proposes, an improved version of the 3 in sewer pipe inspection robot equipped with an emergency evacuation mechanism. The low traction force and poor stepover capability, which were challenges of the first version, have been improved by simply connecting the propulsion units. The coupled propulsion units feature a differential mechanism capable of posture changes via a single wire, enabling adaptation to pipe diameter variations. To traverse obstacles like pipe joints, a control method was devised that detects obstacle contact through current load on the drive wheel motors and slackens the wire. This method was verified through simulated pipe experiments. Load comparisons were made using current waveforms applied to the drive wheels. Our proposed control method significantly improved the step-over capability of the new articulated robots.
comment: The 23rd International Conference on Ubiquitous Robots (UR 2026), 15-18 July, Osaka Ibaraki Campus, Ritsumeikan University, Ibaraki, Osaka, Japan
Semidefinite Relaxations for Collision-Free Motion Planning
We study semidefinite relaxations for collision-free motion planning. We focus on a point robot moving from start to goal through spherical obstacles in $\mathbb{R}^n$, subject to path continuity constraints and squared derivative costs; a setting that is conceptually simple yet captures the hardness of collision-free motion planning. We formulate this problem exactly as a nonconvex problem over polynomial curves, and present a natural semidefinite relaxation. We contribute two key theoretical insights; to our knowledge this is the first theoretical analysis of semidefinite relaxations for collision-free motion planning. First, we show that solving the convex relaxation is equivalent to solving, to global optimality, a related motion planning problem in a potentially higher-dimensional space. This geometric interpretation yields necessary and sufficient conditions for tightness, and a clear intuition for when the relaxation is loose. Second, we show that the relaxation admits a symmetry reduction that makes it significantly smaller than one might expect, with positive semidefinite cone sizes that scale linearly with the polynomial degree and are independent of the ambient dimension. The resulting relaxation is 10 to 100 times faster than direct nonlinear programming transcriptions solved with SNOPT and IPOPT, exhibits significantly lower variance in solve times, and reliably finds a locally optimal path for the original problem. We demonstrate its effectiveness as a convex steering function in an RRT planner for minimum-snap quadrotor planning with $C^4$ continuous trajectories.
ReactSim-Bench: Benchmarking Reactive Behavior World Model Simulation in Autonomous Driving
Reactive capability is a key property of data-driven behavior world model simulators for autonomous driving simulation systems. With this capability, simulated world agents can respond feasibly to autonomous vehicle (AV) behaviors that differ from the log. However, existing behavior simulation benchmarks do not directly measure reactive capability. They often let the simulator jointly control the AV and surrounding agents and evaluate realism through log similarity or open-loop prediction metrics. In this work, we introduce ReactSim-Bench for evaluating the reactive capability of behavior world model simulation in autonomous driving. We decouple the control of agents and the AV, using AV behaviors that differ from the log and require agents to respond as independent AV inputs. To obtain these AV behaviors, we construct a pipeline that uses an AV planner model to generate candidate behaviors and filters the data using rules and manual verification. Collision metrics, map-based metrics, and kinematic feasibility metrics are used to evaluate the safety and rule compliance of reactive responses. We construct 2,636 test scenarios with three categories and conduct a systematic evaluation of state-of-the-art models across multiple architectures, including Transformer-based, diffusion-based, and next-token-prediction-based models. We further analyze how replan frequency affects performance and provide insights for future studies.
WAM4D: Fast 4D World Action Model via Spatial Register Tokens
World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.
comment: 15 pages, 7figures, 9tables
From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving
Closed-loop adversarial training improves autonomous driving safety by exposing policies to rare safety-critical scenarios. Standard pipelines first generate adversarial scenarios and then sample them for policy optimization. However, most existing frameworks remain attack-oriented: collision-driven generators often synthesize unsolvable extreme situations, which can degrade learning, while heuristic samplers ignore the evolving capability of the driving policy, causing sample inefficiency and delayed convergence. We propose AlignADV, a learnability-guided closed-loop adversarial training framework that converts adversarial scenarios into resolvable and capability-aligned curricula. First, we reformulate adversarial scenario generation as a preference alignment problem and employ direct preference optimization to guide the generator toward critical yet resolvable scenarios. Second, we introduce behavioral fingerprints to capture the intrinsic characteristics of the evolving policy and construct a multi-modal capability prediction model that estimates policy performance without expensive closed-loop simulations. By combining resolvability-aligned scenarios with capability predictions, AlignADV develops a dynamic curriculum sampling mechanism that prioritizes scenarios targeting the current policy's vulnerabilities. Experiments on the Waymo Open Motion Dataset demonstrate that AlignADV improves convergence efficiency and final performance, reducing training steps by up to 40.6 percent compared with baseline methods while lowering collision rate and improving route completion under both normal and adversarial traffic conditions. These results highlight a shift from attack-oriented scenario generation to learnability-guided policy improvement, offering a principled direction for safer and more efficient autonomous driving training. Project page: https://meiyuewen.github.io/AlignADV/.
RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation
Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.
SplatlessDF: Continuous Distance Field Mapping with Non-Splatting Gaussians
Recent Gaussian splatting (GS) methods have shown that scenes can be represented efficiently with optimisable Gaussians for high-quality reconstruction and rendering. In this paper, building on this principle, we introduce SplatlessDF, a continuous distance field (DF) mapping framework that uses anisotropic Gaussian elements from a spatial rather than photometric perspective. SplatlessDF directly parameterises the Gaussians and optimises to recover a differentiable DF, enabling distances and gradients to be queried in the spatial domain for downstream robotic tasks such as navigation. Furthermore, SplatlessDF can be coupled with 2D Gaussian splatting (2DGS), providing a unified framework based solely on Gaussian primitives that can learn continuous DF and surface models and supports photometric rendering. We consider two settings: a standalone DF-only formulation and a joint DF-rendering formulation coupled with 2DGS. Experiments show that the standalone formulation provides efficient and accurate distance and gradient queries, while the joint formulation improves rendering geometry and simultaneously models a continuous DF. These results highlight the potential of GS-style representations not only for surface modelling and rendering but also for mapping representations suited to robotic navigation.
Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis
Physically Assistive Robots require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause substantial physical and cognitive fatigue for users with severe motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework. This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, occupational therapists confirmed the generated policies are safe and accurately reflect user preferences.
comment: Accepted to IEEE RO-MAN 2026
Planning with the Views via Scene Self-Exploration
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.
Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning
Coordinating micro-robotic swarms in realistic, time-dependent fluid environments remains a major challenge for biomedical and environmental applications. We present a hybrid CFD-MO-MARL (Computational Fluid Dynamics-Multi Objective-Multi Agent Reinforcement Learning) framework that couples a high-fidelity incompressible Navier--Stokes solver with decentralized proximal policy optimization to learn swarm control policies in oscillatory flow. Sixteen magnetically actuated micro-robots were simulated to navigate a pulsatile arterial waveform within a 2 mm channel while jointly optimizing upstream progression, energy efficiency, and motion smoothness. Conflicting objectives are resolved using Projected Conflicting Gradient (PCGrad) surgery. Without PCGrad, energy and smoothness rewards collapse during training, demonstrating that gradient conflict resolution is essential for stable multi-objective learning. The converged policy achieves progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines by more than 8 reward units on the primary objective. Training reveals three emergent behaviors not encoded in the reward function: hydrodynamic throttling formations that reduce peak flow velocities, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream movement, and individualized final-approach strategies near the target boundary. These results demonstrate that physically realistic fluid--agent interactions can be integrated directly into multi-objective reinforcement learning, providing a scalable framework for micro-swarm control in biomedical navigation, environmental monitoring, and microfluidic systems.
Improving Robotic Generalist Policies via Flow Reversal Steering
Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.
ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour Locomotion
Humanoid parkour requires locomotion policies to coordinate whole-body dynamics across rapidly changing terrains such as stairs, gaps, slopes, and obstacles. Existing reinforcement learning policies are largely reactive, mapping observations directly to actions without explicitly modeling future body states. Such modeling becomes critical in agile locomotion tasks where successful motion execution depends strongly on anticipating upcoming contact transitions and body dynamics. We present ParkourFormer, a Transformer-based sequence modeling framework that reformulates humanoid locomotion as a future-conditioned decision-making problem. The current robot state queries historical sensorimotor trajectories through cross-attention, while a lightweight prediction head forecasts short-horizon future proprioceptive states. The predicted future states, trained with supervised signals, are fused with temporal features to generate actions, enabling the policy to jointly reason over motion history and anticipated future dynamics. We evaluate ParkourFormer on a diverse multi-terrain humanoid parkour benchmark including stairs, gaps, slopes, rough terrain, and obstacle traversal. Experiments in simulation and on a real humanoid robot show that ParkourFormer achieves a 93.85% average traversal success rate on highly challenging terrains, with improvements of up to 47.12% over strong MLP, MoE-based MLP, and vanilla Transformer baselines, while maintaining a single unified policy across all terrain types. These results demonstrate that explicit future-state modeling significantly improves robustness and generalization for agile whole-body locomotion.
comment: Project Homepage: https://mronaldo-gif.github.io/parkourformer.github.io/
Estimation of Ground Reaction Forces from Kinematic Data during Locomotion
Ground reaction forces (GRFs) provide fundamental insight into human gait mechanics and are widely used to assess joint loading, limb symmetry, balance control, and motor function. Despite their clinical relevance, the use of GRF remains underutilised in clinical workflows due to the practical limitations of force plate systems. In this work, we present a force-plate-free approach for estimating GRFs using only marker-based motion capture data. This kinematics only method to estimate and decompose GRF makes it well suited for widespread clinical depolyment. By using kinematics from sixteen body segments, we estimate the centre of mass (CoM) and compute GRFs, which are subsequently decomposed into individual components through a minimization-based approach. Through this framework, we can identify gait stance phases and provide access to clinically meaningful kinetic measures without a dedicated force plate system. Experimental results demonstrate the viability of CoM and GRF estimation based solely on kinematic data, supporting force-plate-free gait analysis.
EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator
Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce \name, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that \name achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, \name could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide
Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals ICLR 2026
Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula. Code is available at: https://github.com/Octavio-Pappalardo/ulee-jax
comment: ICLR 2026; v2 adds link to code: https://github.com/Octavio-Pappalardo/ulee-jax
Asymmetric Friction in Geometric Locomotion
Geometric mechanics models of locomotion have provided insight into how robots and animals use environmental interactions to convert internal shape changes into displacement through the world, encoding this relationship in a ``motility map''. A key class of such motility maps arises from (possibly anisotropic) linear drag acting on the system's individual body parts, formally described via Riemannian metrics on the motions of the system's individual body parts. The motility map can then be generated by invoking a sub-Riemannian constraint on the aggregate system motion under which the position velocity induced by a given shape velocity is that which minimizes the power dissipated via friction. The locomotion of such systems is ``geometric'' in the sense that the final position reached by the system depends only on the sequence of shapes that the system passes through, but not on the rate with which the shape changes are made. In this paper, we consider a far more general class of systems in which the drag may be not only anisotropic (with different coefficients for forward/backward and left/right motions), but also asymmetric (with different coefficients for forward and backward motions). Formally, including asymmetry in the friction replaces the Riemannian metrics on the body parts with Finsler metrics. We demonstrate that the sub-Riemannian approach to constructing the system motility map extends naturally to a sub-Finslerian approach and identify system properties analogous to the constraint curvature of sub-Riemannian systems that allow for the characterization of the system motion capabilities.
comment: 23 pages, 15 figures
Lifted Schrödinger Bridges for Gaussian Mixture Endpoints: Projection Gaps and Path-Space Obstructions
We study stochastic density control between Gaussian-mixture endpoint distributions under Brownian prior dynamics. Since the direct Schrödinger bridge between Gaussian mixtures is generally not available in closed form, we introduce a lifted path-space construction in which each trajectory is augmented with a source--target component label. Consequently, the problem decomposes into Gaussian component-to-component Schrödinger bridges with explicit marginal, drift, and cost formulas, while the mixture-level assignment reduces to a finite-dimensional entropic coupling problem with a Sinkhorn scaling form. We then analyze the projection obtained by discarding or forgetting the label. By construction, the projected law satisfies the original Gaussian-mixture endpoint constraints, but its relative entropy generally differs from the lifted relative entropy by a nonnegative conditional label-information gap. This gap reveals a path-space obstruction: the lifted optimizer cannot, in general, be identified with the direct unlabeled Schrödinger bridge after projection. We also derive the posterior-averaged Markov drift associated with the projected marginal flow, prove a kinetic-energy upper bound, and identify a common path-potential condition under which the projection gap vanishes. Several numerical illustrations showing density and shape control are recorded for a self-contained exposition.
comment: 35 pages. Submitted to a journal; comments are welcome
Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems
The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.
comment: 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.
comment: Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang
Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.
ADAPT: An Autonomous Forklift for Construction Site Operation
Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.
X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation
While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.
comment: Accepted by RSS 2026. Project page: https://x-loco-humanoid.github.io/
EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows
Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.
comment: 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io
Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators
Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing low-cost systems rely on unilateral position control without force feedback, while implementing force-feedback bilateral teleoperation is difficult because low-cost manipulators typically have low-resolution encoders and no joint torque sensors. This paper presents a sensorless 4-channel bilateral teleoperation framework that integrates identified nonlinear dynamics compensation with a disturbance-observer-based velocity and external-force estimation scheme. By interpreting the observer structure in the frequency domain, we clarify the coupling between the velocity- and external-force-estimation bandwidths and derive practical tuning guidelines based on the damping ratio and a single cutoff frequency. Real-robot experiments, including force-sensor comparison and teleoperation tasks, demonstrate that the proposed framework provides practically useful force estimates and enables stable teleoperation in high-speed and contact-rich scenarios under low-cost hardware constraints. As an application, imitation-learning experiments demonstrate that incorporating estimated force information into demonstrations improves task success rates in the tested contact-rich manipulation tasks.
comment: 22 pages, 12 figures, Submitted to IEEE Access
A Unified Control Architecture for Macro-Micro Manipulation using a Active Remote Center of Compliance for Manufacturing Applications
Macro-micro manipulators combine a macro manipulator with a large workspace, such as an industrial robot, with a lightweight, high-bandwidth micro manipulator. This enables highly dynamic interaction control while preserving the wide workspace of the robot. Traditionally, position control is assigned to the macro manipulator, while the micro manipulator handles the interaction with the environment, limiting the achievable interaction control bandwidth. To solve this, we propose a novel control architecture that incorporates the macro manipulator into the active interaction control. This leads to a increase in control bandwidth by a factor of 2.1 compared to the state of the art architecture, based on the leader-follower approach and factor 12.5 compared to traditional robot-based force control. Further we propose surrogate models for a more efficient controller design and easy adaptation to hardware changes. We validate our approach by comparing it against the other control schemes in different experiments, like collision with an object, following a force trajectory and industrial assembly tasks.
comment: 17 pages, 14 figures, submitted to Robotics and Computer-Integrated Manufacturing (RCIM)
Cross-Stage Sensorimotor Perception Scheduling and Sparse Map Encoding for Efficient Edge Embodied Navigation
Embodied agents must close a perception-to-action loop on embedded hardware under tight latency, memory, and energy budgets, making deployment a system-level co-design problem rather than a model-accuracy problem. We study this challenge for modular Object Goal Navigation (ObjectNav), where our profiling shows semantic mapping dominates per-step latency while goal prediction dominates peak memory. We formulate edge embodied navigation deployment as a budget-constrained design-space problem and introduce two orthogonal optimization knobs: SKIP, an adaptive sensorimotor scheduler that formalizes safe skipping as a bounded map-impact criterion and learns a lightweight predictor to estimate it from cheap sensor cues at each \texttt{FORWARD} step, exposing a principled quality-efficiency knob (depth-based updates are always retained); and SCOUT, a sparse-context encoder that couples submanifold sparse convolutions on active map regions with a lightweight dense context stream. On HM3D across server and embedded platforms, SKIP+SCOUT delivers up to 1.7x end-to-end speedup, 50.5% lower peak memory, and 7.1% higher SPL than the dense baseline at the selected operating point, outperforming naively smaller perception backbones. SKIP transfers to a second modular pipeline (PONI) with near-lossless performance and remains robust under depth-sensor noise. Together, SKIP+SCOUT expose a family of device-aware Pareto operating points for edge physical AI systems.
comment: 9 pages, 6 figures
FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation
Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.
comment: Project website: https://allisonandreyev.github.io/grasp.github.io/
Multiagent Systems
Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning
Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.
Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning
Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team-optimal safe behaviour without centralised runtime control. Agents have a shared global specification $φ$ in the safety fragment of Linear Temporal Logic ($\mathsf{LTL}_{\mathsf{safe}}$ ), and select among tuples of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations whose conjunction implies the global specification $φ$. Each agent may rely on the other agents' local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non-stationary multi-armed bandit chooses among a library of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations to select the tuple that optimises team reward, all without forgoing end-to-end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.
Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents ICML 2026
Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.
comment: 9 pages, 5 figures, ICML 2026 WORKSHOP
Resilient Consensus in Agentic AI
Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.
Hierarchical Generative Agents for Simulating Sequential Human Behavior
Complex cognitive, emotional, and social processes shape human evacuations during natural disasters. Accurate modeling and understanding of human behavior in disasters or emergencies can greatly impact the evacuation process by informing more effective planning and resource allocation. However, collecting human data in these situations is very difficult, and existing computational evacuation models assume rational, homogeneous behavior, leading to unrealistic, overly optimistic predictions. To address this gap, we present a simulation framework of sequential human decision-making during an evacuation scenario, introducing cognitively grounded, persona-driven agents. Our framework models evacuation behavior in a grid-based urban environment that evolves over time, capturing fire and other hazards. Human agents are modeled as personas that make sequential decisions in response to environmental stimuli with cognition structured in three levels: high-level evacuation goals, mid-level route reasoning, and low-level navigation. Decision-making is driven by large language models (LLMs) coupled with a cognitive module and calibrated with empirical human evacuation data. We propose a dynamic, stimulus-driven disaster simulation framework that models human evacuation decision-making using persona-conditioned LLM agents and a cognitive hierarchy.
comment: 20 pages, 6 figures
Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems
As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.
Selective Control under Noisy Perception: Governance Failures Hidden by Aggregate Metrics in Modular Networks
A content-moderation system can score well on every standard accuracy metric and still cause real harm, if its mistakes fall on the few users who connect otherwise separate communities. We show this in an agent-based model where N=240 learning agents on a community-structured network each post harmless, productive, or dangerous content, and a regulator removes or penalizes whatever a noisy classifier flags. Overall usefulness barely moves as the noise changes (one-way ANOVA, p=0.96): by aggregate measures, nothing looks wrong. The damage instead concentrates on these bridge users, whose useful posts are wrongly suppressed and whose dangerous posts are wrongly spared. A governance loss (L_gov) that prices these two mistakes separately from the cost of enforcement more than doubles under false-positive-heavy noise. Aggregate accuracy hides who is harmed, and the cheap quantity to audit is how many connections a user has (degree), a near-perfect proxy for the betweenness that defines a bridge (r=0.96).
comment: 39 pages, 7 figures. Code and data: https://github.com/YehudaItkin/noisy-perception-governance
Physics of anticipatory active matter, with application to crowd dynamics
Statistical Physics has traditionally dealt with entities that interact merely based on the present, and possibly past, configurations. This reactive framework is inefficient in many situations involving living beings, such as predators chasing a prey, pedestrians, or even robots. This paper introduces a statistical physical framework for the dynamics of anticipatory agents, whose present-time dynamics depend on the prospective system state that they anticipate. We clarify how these dynamics can be expressed in terms of a cost function constructed based on observations and we show that the dynamics of an anticipatory agent in d dimensions can be mapped onto the dynamics of a (non-anticipatory) chain in d + 1 dimensions, with fluctuations acting transversely on the chain to account for the uncertainty about the future state. Insights from polymer Physics help us characterize the dynamics of these chains and delineate an anticipation horizon beyond which the blurry future can be handled in a mean-field way. The foregoing framework is successfully applied to pedestrian dynamics, leading to a seamless integration of operational and tactical levels in an agent-based model. Even with a minimal expression of the cost, the model succeeds in reproducing various experimental scenarios which are challenging for state-of-the-art models, such as crossing cluttered environments or alighting from a crowded train. The transparent and flexible basis of the model allows the straightforward incorporation of additional mechanisms.
Obligation-Producing Actions
This paper proposes a Situation Calculus solution to the frame problem for obligation-producing actions, which are actions that create obligations on the part of the agent that performs them. As an example of such actions, we have an opening door action performed by an agent, which has the subsequent obligation of getting the door closed. Demolombe and others extend Raymond Reiter's solution to the frame problem for ordinary actions to accommodate obligation-producing actions. Obligation-producing actions do affect the truth value of a newly introduced fluent that captures the accessibility relation used in semantics of obligation modalities in the Situation Calculus. Our work simplifies Demolombe's characterization of the accessibility relation by eliminating the notion of ideality of situations, thereby remaining close to Kripke-style possible-world semantics for deontic logic, in the spirit of Governatori's approach. Furthermore, we spell out details of a complete solution by extending basic action theories of Reiter to the new setting. Finally, we extend Reiter's regression operator for reasoning about actions back to the initial situation to this new setting. Our solution yields intuitive properties that one would expect from obligations: for example, if a sentence is obligatory to an agent in a given situation, it remains so in subsequent situations unless the obligation is explicitly stopped.
Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs
Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long-horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method that treats context management as a closed-loop dynamical process. GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per-turn utility remains tightly concentrated, with median = -0.19, P05 = -0.72, and P95 = 0.30; severe degradation below -1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e-3 s.
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
comment: 18 pages, 11 figures
It's About Time: Temporal References in Emergent Communication
Emergent communication enables agents to develop bespoke languages that improve communication efficiency. Despite the known importance of temporal structure in natural language, there is no existing evidence of temporal references in emergent communication. This paper addresses this gap, by exploring how agents communicate about temporal relationships. We analyse three potential factors for the emergence of temporal references: environmental, external, and architectural. Our experiments demonstrate that altering the loss function is insufficient for temporal references to emerge; rather, architectural changes are necessary. A minimal change in agent architecture, using a different batching method, allows the emergence of temporal references. This modified design is compared with the standard architecture in a temporal referential games environment, which emphasises temporal relationships. The analysis shows that over 95% of the agents with the modified batching method develop temporal references, without changes to their loss function. We consider temporal referencing necessary for future improvements to the agents' communication efficiency, enabling future agents to use a closer to optimal coding as compared to purely compositional languages. These insights provide the basis for incorporation of temporal references into other emergent communication settings, and investigation of other aspects of language.
comment: 23 pages main body and 31 pages supplementary material, 9 figures in main body. Code available at https://github.com/olipinski/TRG
Systems and Control (EESS)
Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets
Online inventory optimization (OIO) is online convex optimization with physical memory: inventory carryover makes the feasible action set depend on the past. A natural principle, used in stochastic inventory learning and recently in OIO under a single linear capacity constraint, is to maintain a hidden target chosen by an online learner and implement its projection onto the currently feasible order-up-to set. We prove that this simple principle is optimal for OIO on arbitrary bounded convex capacity sets. With online gradient descent as the base learner, the method improves the best known regret guarantee for OIO on general convex sets from inverse to inverse-square-root dependence on the common-demand probability, and we prove a matching lower bound. The same principle gives the first polylogarithmic regret guarantee for strongly convex losses and the first dynamic regret guarantee adapting to Euclidean path variation on general convex capacity sets. The analysis introduces a norm alignment principle: the right state variable is the distance from the hidden target to the feasible set, measured in the same norm as the projection. Under norm alignment, this distance evolves pathwise as a scalar queue, with target movement as arrival and common demand as service. This reduction to one-dimensional queue control resolves the state dependence and extends the guarantees to general convex capacity sets, beyond the reach of prior productwise approaches. Experiments on synthetic and real-world inventory data corroborate the theory.
Towards unified Geophysical Data Requirements for Magnetic Navigation (MagNav)
Magnetic Navigation (MagNav) has emerged as a vital alternative Positioning, Navigation, and Timing (PNT) solution, leveraging Earth's magnetic field for robust navigation in GPS/GNSS-degraded or denied environments. Despite its potential, the successful deployment of MagNav is currently hindered by the lack of standardized, high-fidelity geomagnetic reference maps. Existing datasets, primarily designed for geological exploration or academic research, do not meet the distinct operational requirements of navigation systems regarding spatial resolution, error quantification, and global accessibility. This paper initiates a community-focused dialogue on future geophysical data requirements for MagNav, grounded in extensive real-world flight trials. We distinguish between two primary use cases with divergent data needs: Operational MagNav, which requires globally consistent, queryable, and uncertainty-aware datasets for field deployment, and MagNav R&D, which demands comprehensive access to raw survey data to foster innovation. We provide a prioritized set of recommendations for future data requirements, including the development of cohesive, merged datasets, the inclusion of localized 3D uncertainty estimates, and the expansion of the World Magnetic Model (WMM) core field model to spherical harmonic degree 13 to improve consistency. Finally, we emphasize the strategic necessity of designated test ranges to validate these requirements and ensure the operational robustness of MagNav infrastructure.
comment: 21 pages, 1 figure
Whole-Body Impedance Model Predictive Control for Safe Physical Human--Robot Interaction on Floating-Base Platforms
Floating-base robots must balance under rigid contact constraints while interacting safely with humans. Existing whole-body control~(WBC) frameworks allocate the full joint space to locomotion or rely on fixed-gain impedance feedback that accumulates steady-state error under sustained physical human--robot interaction~(pHRI) forces. This paper extends the authors' fixed-base two-layer Impedance MPC to floating-base platforms through a three-level architecture: a centroidal MPC plans contact forces over a 500\,ms horizon; a priority-driven WBC layer resolves balance into joint torques through contact-consistent null-space projection; and the residual null space is governed by a receding-horizon quadratic program~(QP) that predicts and rejects pHRI disturbances using a Kalman-augmented state. A contact-consistent feedback linearization reduces the arm end-effector plant to a double integrator with a \emph{constant} state matrix within each contact mode, enabling offline precomputation of the QP cost and ${\geq}1$\,kHz operation. A covariance-inflation protocol preserves the disturbance estimate across contact-mode switches, guaranteeing zero steady-state error under bounded constant pHRI loads, and an Impedance Equivalence Theorem shows the infinite-horizon limit recovers a classical task-space impedance law whose effective mass, damping, and stiffness adapt to posture and contact configuration. Simulations on a 17-DOF biped and the Unitree G1 humanoid validate the design.
Impedance MPC with Disturbance Estimation for Dexterous Hand Control
Dexterous hands must simultaneously track precise finger trajectories and maintain safe, compliant contact -- objectives in tension for any fixed-gain controller. We present an actuator-agnostic Impedance Model Predictive Control (Impedance MPC) framework for dexterous fingers, instantiating the constant-$A_d$ offset-free architecture established for physical human-robot interaction (pHRI); its stability, recursive-feasibility, and input-to-state-stability guarantees are inherited by preserving the architectural assumptions. An algebraic feedforward reduces the tendon transmission -- hydraulic, cable, pneumatic, twisted-string, or series-elastic -- to a constant-coefficient double integrator, so the QP cost inverse is precomputed offline and a 10-step receding-horizon quadratic program runs at 500\,Hz while enforcing hard constraints on contact force (ISO/TS 15066), actuation limits, and jerk. An encoder-only augmented-Kalman disturbance state drives steady-state error to zero under any constant contact load. On a hydraulically actuated finger -- the worked example platform, adding pressure and cavitation constraints -- the 500\,Hz Kalman MPC attains 0.5\,mrad RMS, 0.1\,mrad steady-state, and 6.6\,mrad peak deflection under 1.5\,Nm contact: 183$\times$, 1500$\times$, and 23$\times$ better than classical impedance. The realized first-move stiffness (18$\to$323\,Nm/rad with update rate) is independently verified. The architecture scales to a 16-DOF LEAP Hand MuJoCo simulation, recovering from 2.5\,N grasp-load disturbances within 0.7\,s.
A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems
This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ($ε^2 = 0.126$). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.
comment: 17 pages, 12 figures
Provably Safe, Yet Scalable Reinforcement Learning
Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.
Optimization Models and Steady-State Minimum-Fuel Operating Strategies for Hydrogen-based Hybrid Electric Aerospace Propulsion Systems
This paper presents an optimization framework for the operation of hydrogen-based hybrid electric aerospace propulsion systems consisting of a hydrogen gas turbine and an electric motor powered by a solid oxide fuel cell, connected to the gas turbine via multiple gas channels and heat exchangers. Our framework computes the minimum-fuel optimal operating strategies over a flight mission accounting for the complex propulsion system with strong thermodynamic and mechanical coupling between components. First, we identify surrogate optimization models of the components employing high-fidelity model simulations. Second, we frame the minimum-fuel optimal control problem over a given flight mission and parse it into a static nonlinear optimization problem that can be efficiently solved with off-the-shelf nonlinear programming algorithms. Finally, we apply our optimization framework to a typical flight mission of an advanced, commuter aircraft (Beechcraft 1900D market segment), considering a parallel propulsion system architecture with four different configurations that share a common baseline but differ in the inclusion of an additional battery and by-pass valves around the two heat exchangers. The resulting optimal trajectories are validated against high-fidelity simulation results, demonstrating the accuracy of our framework. Results show that adding by-pass valves around the air and hydrogen heat exchangers can reduce fuel consumption by 19.11 % without the battery, and by 19.56% with the battery. We show that adding a battery yields a slight increase in fuel consumption (below 1%) for future projected energy densities under steady-state conditions. Conversely, when considering state-of-the art energy densities, the additional battery weight outweighs the benefits, limiting its potential applicability to only assisting transients, which are not considered in the present work.
A Generalized Plant Perspective on Linear-Convex Feedback Optimization
Feedback optimization is a control approach for driving a dynamical system to the solution of an optimization problem by interconnecting the plant with an algorithm. Existing stability guarantees typically rely on timescale separation, enforced by conservative gain bounds that limit transient performance and require a pre-stabilized plant. This paper revisits the robust control perspective on feedback optimization. We formulate the plant-optimizer interconnection as a generalized plant, where the cost gradients are characterized by Zames--Falb Integral Quadratic Constraints. Classical timescale-separation bounds are recovered as a special case of static multipliers, with dynamic multipliers yielding substantially tighter stability margins. The formulation also enables IQC based synthesis of dynamic output feedback controllers that jointly stabilize the plant and optimize transient performance, with possible model uncertainty absorbed into an uncertainty channel. For constrained problems, the framework extends to dynamic controllers that generalize projected gradient flows. Numerical examples illustrate the benefits and flexibility of the proposed approach.
Orbital Station-Keeping in the Earth-Moon System via Nonlinear Backstepping
A nonlinear orbital station-keeping solution for the circular and elliptic versions of the Earth-Moon Restricted Three-Body Problem (R3BP) is developed via a backstepping technique. Formal guarantees for global asymptotic stability (GAS) are attained, as shown through Lyapunov's stability theory. The adequacy of the proposed control law is evaluated through the means of numerical trials over closed periodic solutions of the circular and elliptic R3BPs. The ramifications of the control gain choice are carefully studied and simulated.
comment: Presented at the 2025 IEEE 19th International Conference on Control & Automation (ICCA). Please cite the published version
A Floquet Mode LQR for Orbital Station-Keeping in Cislunar Space
A linear optimal control law for orbital station-keeping in the Earth-Moon Restricted Three Body Problem (R3BP) is developed via Linear Quadratic Regulator (LQR) theory. First, the cost function is established considering a periodic state-weight matrix, leveraging stability information of the target orbits retrieved through Floquet theory. Then, the resulting periodic Riccati differential equation is solved and local asymptotic stability guarantees are shown. Finally, the performance of the proposed LQR when tracking periodic orbits in the circular and elliptic R3BPs is analyzed numerically.
comment: Accepted for presentation at the 2026 European Control Conference (ECC)
A Feedback Stability Theorem for Frequency-dependent Compensation of Excess and Lack of Passivity
This article studies the stability of feedback interconnections of linear time-invariant systems based on frequency-dependent passivity indices. Using these frequency-dependent passivity indices, we show that the feedback interconnection of two systems can be certified to be stable even if both systems have a lack of passivity in terms of their scalar passivity indices. The main contribution of this paper is a new stability theorem based on frequency-dependent passivity indices. Moreover, we discuss the connection of the proposed feedback stability theorem to prior results based on scalar passivity indices. A numerical case study showcases the advantages of frequency-dependent passivity indices over scalar indices for feedback interconnections of linear systems.
Intelligent Domain Adaptation for Power System Transient Stability Assessment Under Varying Operating Scenarios
While deep learning-based transient stability assessment (TSA) approaches have exhibited great potential in power system stability monitoring, they are prone to undergo performance degradation in practical contexts with frequent variations of operating conditions. To address this issue, this work develops an adaptive TSA framework via domain adaptation-enabled deep transfer learning. First, for the sake of capturing the primary transient stability characteristics, a robust metric, i.e., heterogeneous hybrid distribution metric (HHDM), is designed through mathematical means to effectively handle multi-scale Gaussian and long-tail distributions of transient responsive data and to precisely quantify the intrinsic distributional discrepancies between the source and target domains corresponding to different operating scenarios. With the help of the HHDM, a Bayesian theory-based dual-distribution domain adaptation method is constructed, aligning not only marginal probability distributions between domains but also the distributions of sub-domain categories. Such alignments enable fine-grained transient stability feature transfer, helping significantly improve the adaptability of a well-trained TSA model to target domains. Furthermore, a multilayer sparse regularization algorithm is introduced to mitigate feature volatility caused by variations in operating scenarios, thereby enhancing the model's generalization in the presence of unforeseen scenarios. Numerical tests on three test systems illustrate that, compared with conventional methods, the proposed framework improves online TSA accuracy by 0.5% to 5% in a cost-effective manner, with the learning cost for TSA model update largely reduced.
Topology Optimization for DC Circuit Breaker Placement in HVDC Switching Stations
HVDC protection will be required in future multiterminal HVDC grids to prevent large outages caused by DC faults. Therefore, system-level protection design is essential for the development of HVDC switching stations that connect several converter stations and lines within these grids. This paper presents an optimization method for the design of HVDC circuit breaker (DCCB) configurations in HVDC switching stations and electrical energy hubs. This approach builds on the current practice of using selected configurations based on pre-defined protection strategies. In contrast to these existing methods, the DC switching station design in the proposed method offers significantly more flexibility and allows the consideration of large numbers of relevant operating conditions, leading to more effective, optimal design outcomes. A mixed-integer linear optimization problem is formulated to design the DC protection and minimize the risk of high impact DC faults. An example case study demonstrates that the optimization method allows the calculation of the optimal number of DCCBs for a given DC switching station, based on the failure rates of DC grid components and the DCCB cost relative to the fault impact. With these results, the marginal benefit to risk reduction of each additional DCCB included in a DC switching station is calculated. Moreover, the result of the optimization problem provides the optimal breaker configuration for the required number of DCCBs and can consequently be used as a topological design tool for DC switching stations.
comment: 19 pages, 8 figures, Submitted to IEEE Transactions on Power Delivery
Short-Horizon Position Accuracy of Single-Track Models: Implications for Motion Planning of Autonomous Vehicles
Accurate and computationally efficient vehicle models are essential for motion planning of autonomous vehicles, where positional accuracy directly affects trajectory feasibility and safety. However, the positional accuracy has not been systematically evaluated against real measurements. Therefore, this paper compares the short-horizon positional accuracy of three single-track vehicle models against vehicle measurements across various driving maneuvers. Model parameters are identified through dedicated experiments with the instrumented test vehicle. Rather than identifying a single best model, this work aims to provide insight into the trade-offs between model complexity, parameterization quality, and positional accuracy for informed model selection in Model Predictive Control applications.
comment: Submitted to The International Journal of Automotive Engineering, Official Journal of the Society of Automotive Engineers of Japan, Inc. (JSAE)
StreamRTPS: Increasing DDS Bandwidth Efficiency by Reducing Protocol Overhead
In this paper, we propose three extensions to the Real-Time Publish Subscribe wire protocol, on which Data Distribution Service (DDS) is based, to improve bandwidth efficiency. First, a stream negotiation mechanism exchanges static header information during discovery, replacing the full RTPS header at runtime with a compact 2 B identifier. Second, a payload aggregation scheme aggregates samples for the same locator into single UDP packets, reducing IP and UDP header costs. Third, a predictive heartbeat suppression strategy reduces control traffic by omitting heartbeats for periodic communication patterns, falling back upon detected loss or timing violations. All three mechanisms preserve Real-Time Publish Subscribe(RTPS) compatibility by extending DDS discovery to activate these features when supported. Experimental results show that stream headers reduce bandwidth consumption by up to 27.9 % compared to conventional RTPS under best-effort transport, and that heartbeat suppression yields a further 22.7 % reduction on top of stream headers under reliable transport, while preserving transmission latency in both cases.
comment: 8 pages, 8 figures
SOS-based Stability Verification for Saturated INDI Control of Hybrid-VTOL Aircraft Pitch Rate Dynamics
Incremental nonlinear dynamic inversion (INDI) is a prominent flight-control strategy valued for its robust disturbance rejection; however, its formal stability verification has traditionally been limited to linearized dynamical models. This paper presents a formal nonlinear stability certificate for a saturated INDI pitch-rate controller for a hybrid vertical take-off and landing (VTOL) aircraft by representing the INDI controller via an equivalent recurrent equilibrium network (REN). By casting the saturated INDI architecture as a REN, the closed-loop dynamics are exactly mapped to an augmented state-feedback system. This structural equivalence enables the use of sum of squares (SOS) programming to synthesize a locally valid Lyapunov function without relying on conservative bounding approximations. The resulting certificate yields an inner estimate of the region of attraction (RoA) that explicitly accounts for actuator saturation, formally verifying the controller's stability in operating regimes where standard linear margins lose their validity.
Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation
We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.
Environment-Aware Stable Neural Koopman Dynamics Learning for Input-Driven Systems under Environmental Constraints
Constructing predictive models of nonlinear dynamical systems from measurement data is a longstanding problem in systems identification and control. Although Neural ordinary differential equations~(Neural ODEs), Koopman operator approximations, and input-aware architectures have each moved the field forward, none simultaneously addresses environment-varying operating conditions, rigorous stability guarantees, and input-to-state stability (ISS) certification within a unified trainable framework. This paper introduces Environment-Aware Stable Neural Koopman Dynamics Learning (ESNKD), which integrates four components: (i)~a bundle-structured encoder that maps environmental observations to a geometrically regularized latent manifold, drawing on the fiber bundle framework; (ii)~an input-conditioned Neural ODE whose residual term handles arbitrary external signals, extending the input concomitant philosophy; (iii)~a contraction synthesis layer enforcing convergence via Persidskii-type tractable linear inequalities, analogous to the certification mechanism; and (iv)~a Koopman lifting stage with LMI-based ISS verification that follows the theoretical pipeline of. Theoretical guarantees cover solution existence and uniqueness, incremental exponential stability, ISS with explicit gain bounds, and robustness to environmental perturbation. Experiments on five benchmark systems, including two robotic manipulation platforms, show consistent improvements over five competitive baselines in both prediction accuracy and safety certification rates.
HPC-Enabled Generator Importance Assessment for RTO-Scale Resource Adequacy Planning
Modern power systems are increasingly under stress as aging assets approach retirement and load growth outpaces new generation construction. The severity of this challenge varies by region: in the EU, the transmission grid can partially compensate for local generation shortfalls, while in the US, generation tends to be more localized, making retirements harder to offset. Retirement of generation has consequences for system reserves, fuel supply chain, and public health. We present an high-performance computing (HPC) framework for rapidly assessing the grid importance of individual generating units and ranking them by primary fuel type, operating cost, or grid impact. Historically, such studies were computationally intensive and therefore conducted infrequently. This work demonstrates that such assessments can be completed in minutes, enabling planners to evaluate a much broader range of generation portfolio scenarios than was previously possible.
Semidefinite Relaxations for Collision-Free Motion Planning
We study semidefinite relaxations for collision-free motion planning. We focus on a point robot moving from start to goal through spherical obstacles in $\mathbb{R}^n$, subject to path continuity constraints and squared derivative costs; a setting that is conceptually simple yet captures the hardness of collision-free motion planning. We formulate this problem exactly as a nonconvex problem over polynomial curves, and present a natural semidefinite relaxation. We contribute two key theoretical insights; to our knowledge this is the first theoretical analysis of semidefinite relaxations for collision-free motion planning. First, we show that solving the convex relaxation is equivalent to solving, to global optimality, a related motion planning problem in a potentially higher-dimensional space. This geometric interpretation yields necessary and sufficient conditions for tightness, and a clear intuition for when the relaxation is loose. Second, we show that the relaxation admits a symmetry reduction that makes it significantly smaller than one might expect, with positive semidefinite cone sizes that scale linearly with the polynomial degree and are independent of the ambient dimension. The resulting relaxation is 10 to 100 times faster than direct nonlinear programming transcriptions solved with SNOPT and IPOPT, exhibits significantly lower variance in solve times, and reliably finds a locally optimal path for the original problem. We demonstrate its effectiveness as a convex steering function in an RRT planner for minimum-snap quadrotor planning with $C^4$ continuous trajectories.
Battery Bidding under Price Uncertainty in Wholesale Electricity Markets
Grid-scale batteries increasingly influence outcomes in wholesale electricity markets, but their observed bid patterns remain difficult to interpret. In particular, bids that appear to reflect strategic withholding may instead arise from rational operations under price uncertainty and risk management. We develop an asset-level model of a price-taking battery that submits stepwise buy and sell bid curves in the day-ahead market under a finite set of price scenarios. The battery chooses quantity--price pairs to maximize a mean--CVaR objective subject to physical and market constraints. A direct formulation is a mixed-integer linear program, but we show that its integer decisions can be removed, yielding an exact linear programming reformulation suitable for empirical analysis. Our empirical results deliver three insights. First, withholding behavior can arise even without market power, because scarce stored energy and uncertain future prices increase the value of holding energy. Second, the effect of uncertainty depends on the state of charge: when stored energy is scarce, greater uncertainty raises sell bid prices, whereas when stored energy is abundant it can lower them. Third, risk management reshapes bid curves into layered structures that secure profitable execution across a broad set of scenarios while preserving some exposure to rare but valuable price spikes.
Same-Origin Policy for Agentic Browsers
Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.
Pseudonym Scheme Based on Hybrid Certificates for Security Credential Management System in Vehicular Communications
In recent years, the Institute of Electrical and Electronics Engineers (IEEE) and the European Telecommunications Standards Institute (ETSI) have developed a series of security communication standards for vehicular communications. These standards include mechanisms such as the Security Credential Management System (SCMS) and Butterfly Key Expansion (BKE) to protect vehicle privacy. However, these standards are mainly based on the Elliptic-Curve Cryptography (ECC), which may be vulnerable to attacks from quantum computing in the future. In response to this potential risk, this study proposes a hybrid certificate that combines the ECC with Post-Quantum Cryptography (PQC). This approach enables infrastructure systems to be built on cryptographic foundations that are more resilient to quantum-based attacks. Furthermore, this study presents a generalized pseudonym scheme that is compatible with various cryptographic algorithms for generating pseudonym certificates. This design aims to eliminate the possibility of inferring any correlation between the public key in a pseudonym certificate and that in an enrollment certificate. This study also conducts a comprehensive performance evaluation of the RSA, ECC, and PQC algorithms, particularly those standardized by the National Institute of Standards and Technology (NIST). The comparison considers factors such as message length and computation time. Based on the findings, this study recommends suitable pseudonym schemes that adopt hybrid certificates for secure and efficient use in vehicular communications.
Resilient Consensus in Agentic AI
Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.
CREST: Deployment-Realistic Hardware-in-the-Loop NAS for Embedded Sensing Systems
Deploying neural networks on low-power microcontrollers (MCUs) requires selecting model architectures under tight memory, latency, and energy constraints. Existing workflows often simplify this process along one or more axes: static proxy costs such as FLOPs or parameters, treating one MCU as representative, and continuous-inference tests instead of deployed sensing schedules. These assumptions can mis-rank Pareto-front candidates, miss infeasible deployments, and obscure schedule-dependent energy. We present CREST (Cross-platform Runtime Evaluation and Search Tool), a deployment-realistic hardware-in-the-loop (HIL) neural architecture search (NAS) framework for MCU sensing systems. CREST keeps the optimizer, HIL measurement boundary, logging, and replay workflow fixed while exposing workload, model family, target backend, schedule, quantization, and scoring policy as configurable axes. This makes deployment effects experimentally separable within one reusable workflow. We evaluate CREST on inertial odometry and audio classification across three Arm Cortex-M targets. For inertial odometry, measured-energy HIL search reduces median per-inference energy by 41.7% versus FLOPs-based selection and 40.8% versus memory-traffic-based selection at similar error. FLOPs-based selection also chooses infeasible deployments on memory-constrained targets. On the STM32 N657 target, continuous-inference and duty-cycled searches produce different Pareto frontiers. For audio classification, the same application-level policy selects different DS-CNN architectures on different boards, and cross-board replay changes deployment cost substantially. Overall, CREST shows that deployment-realistic MCU NAS must jointly optimize model architecture, target platform, runtime schedule, and deployment policy rather than relying only on static proxy costs or continuous-inference measurements.
comment: 14 pages, 10 figures, 7 tables
A state-of-charge based formulation for storage participation in electricity markets: Technical Reference
We consider a storage device and develop a basic market design that represents the salient characteristics of storage such as state-of-charge (SOC) and round-trip efficiency. The contribution of this paper is a market design that reflects these technical characteristics, does not require bids and offers by the storage owner within the market horizon, but does require an end-of-horizon bid/offer for deviating the end-of-horizon SOC from the start-of-horizon SOC\@. Small examples are used to illustrate the market design and large-scale implementation is considered. Several extensions are sketched in Appendices.
comment: 24 pages, 1 figure
Robust Sampling-Based Covariance Steering for Aerocapture Guidance
Aerocapture is a maneuver where a spacecraft dives through the atmosphere of a planet or moon to reduce its velocity and prepare for orbital insertion. Aerocapture allows for higher cruise velocities and reduces fuel consumption, decreasing transit time and increasing payload mass. However, uncertainties in the atmospheric entry state and atmospheric density increase the risk of aerocapture. Dynamic nonlinearities and nonlinearities caused by the state-dependence of the atmospheric density pose additional challenges. This work develops a robust sampling-based covariance steering algorithm designed for aerocapture guidance. Our proposed algorithm leverages sampled nonlinear system trajectories to improve evaluation of the delta-V required for aerocapture and address nonlinearities caused by the aerocapture dynamics and atmospheric disturbances. We perform Monte Carlo simulations with dispersed entry and atmospheric conditions on aerocapture scenarios at Mars and Uranus and demonstrate a 5-15% reduction in the 99th-percentile, 99.7th-percentile, and worst-case delta-V required for aerocapture when compared against a state-of-the-art covariance steering algorithm.
comment: AAS/AIAA Astrodynamics Specialist Conference 2025
Real-time nonlinear model predictive control framework for event-triggered switching in industrial batch polymerization process
Controlling batch polymerization is challenging because the absence of a steady operating point prevents standard linearization; the dynamics are intrinsically nonlinear; and multi-phase operation induces state-triggered switching. This study systematically combines four established real-time NMPC ingredients, smooth mode blending, advanced-step warm starts, variable scaling, and a capped iteration budget, to attain real-time feasibility without ad hoc switching heuristics. We provide practice-oriented guidance for selecting smoothing gains and locating switching surfaces, and we make explicit the approximations introduced by smoothing such that, with appropriate tuning, the smoothed and original switching logic are numerically indistinguishable at solver-tolerance levels. All results are obtained in closed-loop simulation using an industrial gas-liquid polymerization benchmark with estimator-in-the-loop, compared against PID and conventional NMPC baselines. Results show improved constraint satisfaction and shorter batch duration under bounded computation, while an ablation study quantifies the specific contributions of each component individually
Controller and Control Architecture Co-Design via Mixed-Integer System-Level Synthesis
We study controller and control-architecture co-design for dynamic output-feedback systems. The architecture selects active sensors and actuators, sensor-to-actuator links, and link delays, with costs for hardware activation and communication latency. Direct optimization over controller transfer matrices and discrete links is mixed-integer nonconvex; common alternatives fix the architecture, use regularization, or restrict the controller information pattern to a quadratically invariant (QI) class. We instead optimize finite-horizon output-feedback system-level synthesis (OF-SLS) responses. Binary variables select sensors, actuators, links, and delays, and indicator constraints zero unavailable FIR response blocks before the selected delays. For implementation-local OF-SLS architectures, this gives an exact mixed-integer convex program over a prescribed finite delay menu. A global solve certifies the best architecture-response pair for the chosen delay menu, FIR horizon, admissible architecture set, and scalarization weight. The same encoding gives a QI controller-support reference problem. In a vehicle-platoon benchmark, 99 of 8748 architectures are QI-compatible. At equal architecture cost, the selected non-QI OF-SLS architecture reduces performance loss by a factor of 3.8 relative to the best QI architecture and outperforms regularization-based and canonical information-flow baselines.
An Analytical Methodology for Quantifying Airspace Conflict Rate and Complexity
Air traffic growth, advanced air mobility, and increasingly autonomous operations are driving the need for scalable and adaptive airspace design methodologies. Central to this challenge is quantifying how traffic flow structure and demand, governed in part by airspace geometry, influence conflict generation and operational complexity. This paper presents an analytical framework for computing conflict rate and conflict probability in structured airspace using stochastic flow models. Traffic streams are modeled as renewal processes with prescribed inter-arrival time distributions, while interactions between flows are captured through geometry-dependent minimum spacing constraints at merges and crossings. Within this formulation, closed-form upper bounds on the expected conflict rate and conflict probability per aircraft are derived as functions of flow configuration and demand. These metrics are interpreted as complementary measures of airspace complexity, reflecting controller workload and per-aircraft operational risk. The methodology is applied to representative hexagonal cell geometries with varying routing structures and flow distributions. Results reveal non-monotonic tradeoffs between routing flexibility, capacity, and conflict generation, with intermediate flow configurations outperforming both highly constrained and highly distributed cases. The proposed framework provides a tractable tool for evaluating airspace design alternatives and complexity-informed traffic management strategies.
comment: 28 pages, 20 figures, submitted to AIAA Journal of Air Transportation
Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning
Coordinating micro-robotic swarms in realistic, time-dependent fluid environments remains a major challenge for biomedical and environmental applications. We present a hybrid CFD-MO-MARL (Computational Fluid Dynamics-Multi Objective-Multi Agent Reinforcement Learning) framework that couples a high-fidelity incompressible Navier--Stokes solver with decentralized proximal policy optimization to learn swarm control policies in oscillatory flow. Sixteen magnetically actuated micro-robots were simulated to navigate a pulsatile arterial waveform within a 2 mm channel while jointly optimizing upstream progression, energy efficiency, and motion smoothness. Conflicting objectives are resolved using Projected Conflicting Gradient (PCGrad) surgery. Without PCGrad, energy and smoothness rewards collapse during training, demonstrating that gradient conflict resolution is essential for stable multi-objective learning. The converged policy achieves progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines by more than 8 reward units on the primary objective. Training reveals three emergent behaviors not encoded in the reward function: hydrodynamic throttling formations that reduce peak flow velocities, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream movement, and individualized final-approach strategies near the target boundary. These results demonstrate that physically realistic fluid--agent interactions can be integrated directly into multi-objective reinforcement learning, providing a scalable framework for micro-swarm control in biomedical navigation, environmental monitoring, and microfluidic systems.
Frequency-Domain Characterization of Load Demand from Electrified Highways
Electrified roadways (ER) equipped with dynamic wireless power transfer (DWPT) capabilities can patently extend the driving range and reduce the battery size of electric vehicles (EVs). However, due to the spatial arrangement of the transmitter coils in the ER, the DWPT load exhibits frequency content that could excite power system frequency dynamics. In this context, this work aims to study the spectrum of DWPT loads under different traffic conditions. Under simplifying assumptions, we develop statistical models to identify the location and relative magnitude of DWPT load harmonics. Our analysis reveals that the fundamental frequency depends on ER coil spacing and average EV speed. In the worst-case yet unlikely scenario that EVs move in a synchronized fashion, the amplitude of harmonics scales with the EV count. On the contrary, when EVs move freely, harmonics scale with the square root of the EV count. Platoon formations can accentuate harmonics. The spectral content around harmonics decreases in magnitude and increases in bandwidth with the harmonic index. The load of a single EV moving at a time-varying speed can be modeled as a frequency-modulated (FM) signal. Despite the simplifying assumptions, the derived models offer valuable insights for ER planners and grid operators. Dynamic simulations of a modified WECC model with DWPT loads synthesized from realistic EV trajectories and ER specifications corroborate some of these insights.
comment: 16 Pages, 18 figures
Lifted Schrödinger Bridges for Gaussian Mixture Endpoints: Projection Gaps and Path-Space Obstructions
We study stochastic density control between Gaussian-mixture endpoint distributions under Brownian prior dynamics. Since the direct Schrödinger bridge between Gaussian mixtures is generally not available in closed form, we introduce a lifted path-space construction in which each trajectory is augmented with a source--target component label. Consequently, the problem decomposes into Gaussian component-to-component Schrödinger bridges with explicit marginal, drift, and cost formulas, while the mixture-level assignment reduces to a finite-dimensional entropic coupling problem with a Sinkhorn scaling form. We then analyze the projection obtained by discarding or forgetting the label. By construction, the projected law satisfies the original Gaussian-mixture endpoint constraints, but its relative entropy generally differs from the lifted relative entropy by a nonnegative conditional label-information gap. This gap reveals a path-space obstruction: the lifted optimizer cannot, in general, be identified with the direct unlabeled Schrödinger bridge after projection. We also derive the posterior-averaged Markov drift associated with the projected marginal flow, prove a kinetic-energy upper bound, and identify a common path-potential condition under which the projection gap vanishes. Several numerical illustrations showing density and shape control are recorded for a self-contained exposition.
comment: 35 pages. Submitted to a journal; comments are welcome
Learning Developmental Scaffoldings to Guide Self-Organisation
From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.
comment: 8 pages + acknowledgements and references, 5 figures. Camera-ready version for ALife 2026
Posterior error bounds for prior-driven balancing in linear Gaussian inverse problems
In large-scale Bayesian inverse problems, it is often necessary to apply approximate forward models to reduce the cost of forward model evaluations, while controlling approximation quality. In the context of Bayesian inverse problems with linear forward models, Gaussian priors, and Gaussian noise, we use perturbation theory for inverses to bound the error in the approximate posterior mean and posterior covariance resulting from a linear approximate forward model. We then focus on the smoothing problem of inferring the initial condition of linear time-invariant dynamical systems, using finitely many partial state observations. For such problems, and for a specific model order reduction method based on balanced truncation, we show that the impulse response of a certain prior-driven system is closely related to the prior-preconditioned Hessian of the inverse problem. This reveals a novel connection between systems theory and inverse problems. We exploit this connection to prove the first a priori error bounds for system-theoretic model order reduction methods applied to smoothing problems. The bounds control the approximation error of the posterior mean and covariance in terms of the truncated Hankel singular values of the underlying system.
Data-Driven Active Power Flow Modeling: A Behavioral Systems Approach
The increasing decentralization of power systems driven by a large number of renewable energy sources poses challenges in power flow optimization: Partially unknown power line properties can render model-based approaches unsuitable. With the increasing deployment of sensors, data-driven methods rise as a promising alternative, offering flexibility to adapt changes and deal with unknown properties. In this paper, we propose a novel data-driven representation of nonlinear active power flow equations for radial grids based on Willems' Fundamental Lemma. Our approach allows for direct integration of input/output data into active power flow optimization, enabling cost minimization and constraint enforcement without requiring explicit knowledge of the electrical properties of the grid. Moreover, we derive a computationally tractable convex relaxation and show in a numerical case study that our approaches yield results that are identical to optimal active power flow formulations with known parameters.
ADAPT: An Autonomous Forklift for Construction Site Operation
Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.
Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators
Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing low-cost systems rely on unilateral position control without force feedback, while implementing force-feedback bilateral teleoperation is difficult because low-cost manipulators typically have low-resolution encoders and no joint torque sensors. This paper presents a sensorless 4-channel bilateral teleoperation framework that integrates identified nonlinear dynamics compensation with a disturbance-observer-based velocity and external-force estimation scheme. By interpreting the observer structure in the frequency domain, we clarify the coupling between the velocity- and external-force-estimation bandwidths and derive practical tuning guidelines based on the damping ratio and a single cutoff frequency. Real-robot experiments, including force-sensor comparison and teleoperation tasks, demonstrate that the proposed framework provides practically useful force estimates and enables stable teleoperation in high-speed and contact-rich scenarios under low-cost hardware constraints. As an application, imitation-learning experiments demonstrate that incorporating estimated force information into demonstrations improves task success rates in the tested contact-rich manipulation tasks.
comment: 22 pages, 12 figures, Submitted to IEEE Access
TRUST-UP: Trustworthy Reinforcement learning Using Safe Techniques for UAV Pursuit
Reinforcement Learning (RL) enables autonomous aerial vehicles to adapt quickly and make efficient decisions, making it well-suited for dynamic urban air mobility operations. However, the lack of safety guarantees and transparency hinders the airworthiness certification of RL-based flight control systems, particularly in low-altitude urban environments with human presence. This paper proposes a trustworthy reinforcement learning algorithm that utilizes safe techniques to address the AI trustworthiness requirements for aviation safety, ensuring the transparent and certifiable deployment of RL in safety-critical aerial operations. Specifically, we proposed a Trustworthy Reinforcement learning Using Safe Techniques for UAV Pursuit (TRUST-UP), which consists of two key components: a safety filter constructed from Control Barrier Functions (CBFs) that transforms unsafe RL actions into provably safe flight commands, and a switching strategy that enhances feasibility while maintaining operational transparency. These components enable trustworthy AI deployment in urban airspace, satisfying technical robustness and transparency requirements for aviation certification. Simulation results demonstrate that TRUST-UP enables autonomous UAVs to safely navigate congested urban environments while maintaining human-interpretable decision logic. This work contributes toward certifiable and explainable AI frameworks for low-altitude aviation, addressing the critical need for trustworthy autonomous flight systems in future urban air mobility.
comment: Accepted manuscript. Accepted for publication in Advanced Engineering Informatics
Hierarchical Distributed Architecture for the Least Allan Variance Atomic Timing
In this paper, we propose a hierarchical distributed timing architecture based on an ensemble of miniature atomic clocks. The goal is to ensure synchronized and accurate timing in a normal operating mode where Global Navigation Satellite System (GNSS) signals are available, as well as in an emergency operating mode during GNSS failures. At the lower level, the miniature atomic clocks employ a distributed control strategy that uses only local information to ensure synchronization in both modes. The resulting synchronized time or generated time scale has the best frequency stability, as measured by the Allan variance, over the short control period. In the upper layer, a supervisor controls the long-term behavior of the generated time scale. In the normal operating mode, the supervisor periodically anchors the generated time scale to the standard time based on GNSS signals, while in the emergency operating mode, it applies optimal floating control to reduce the divergence rate of the generated time scale, which is not observable from the measurable time difference between the miniature atomic clocks. This floating control aims to explicitly control the generated time scale to have the least Allan variance over the long control period. Finally, numerical examples are provided to show that the proposed architecture achieves an Allan variance on the order of $10^{-23}$ over averaging times ranging from one second to several days. This demonstrates its effectiveness and feasibility for high-precision, GNSS-resilient atomic timing.
Analytical PI Tuning for Second-Order Plants with Monotonic Response and Minimum Settling Time
This study presents two analytical closed-form PI controller tuning solutions for second-order plants with real poles, each achieving monotonic step response and minimum settling time. The first solution employs pole-zero cancellation, placing the controller zero at the slower plant pole and reducing the closed-loop dynamics to a critically damped second-order system. The second solution, applicable when the plant pole ratio is less than two, places all three closed-loop poles at a common location without cancelling any plant pole, yielding a closed-loop transfer function with a triple real pole and a zero. Despite retaining a closed-loop zero, this solution achieves strictly faster settling time than the pole-zero cancellation method in its region of applicability. The two solutions coincide at the boundary pole ratio of two and together form a continuous piecewise-analytical tuning covering the full range of plant pole ratios. This study further establishes that closed-loop transfer functions of the form a^n/(s + a)^n possess a maximum sensitivity Ms together with phase margin and gain margin that are independent of the pole location a and depend solely on the order n, yielding universal robustness constants for each n. A closed-form expression GM(n) = 1 + sec^n(pi/n) is established for the gain margin of the family. Numerical verification confirms the analytical results across multiple plant configurations.
comment: This submission has been withdrawn by the author; its results are subsumed by arXiv:2606.02868
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.
comment: Project website: https://allisonandreyev.github.io/grasp.github.io/
Optimization via a Control-Centric Framework
Optimization plays a central role in intelligent systems and cyber-physical technologies, where speed and reliability of convergence directly impact performance. In control theory, optimization is the foundation of widely-used design methodologies such as linear quadratic regulation, $H_\infty$ control, and model predictive control. In contrast, this paper develops a control-centric framework for optimization itself, where algorithms are constructed directly from Lyapunov stability principles rather than being proposed first and analyzed afterward. A key element is the stationarity vector, which encodes first-order optimality conditions and enables Lyapunov-based convergence analysis. By pairing a Lyapunov function with a selectable decay law, we obtain continuous-time dynamics with guaranteed exponential, finite-time, fixed-time, or prescribed-time convergence. Within this framework, we introduce three feedback realizations of increasing restrictiveness: the Hessian-gradient, Newton, and gradient dynamics. Each realization shapes the decay of the stationarity vector to achieve the desired rate. These constructions unify unconstrained optimization, extend to constrained problems via Lyapunov-consistent primal-dual dynamics, and broaden results for minimax and generalized Nash equilibrium seeking problems beyond exponential stability. In total, the framework spans six problem classes and four convergence regimes, yielding a unified design recipe across twenty-four combinations, nine of which, to the best of our knowledge, have no direct continuous-time counterpart in the prior literature. The framework provides systematic design tools for optimization algorithms in control and game-theoretic problems.
comment: This work has been submitted to the IEEE for possible publication. 19 pages, 3 figures
Storage and Transport Capacity Design for a Self-Reliable Two-Node Stochastic Resource System
We study a two-node stochastic resource system operating over a finite horizon. Each node experiences uncertain supply and demand and is equipped with finite storage. The objective is to ensure that resource levels remain within prescribed limits with high probability. To this end, we formulate a chance-constrained capacity-design problem in which resources can be exchanged through a capacity-limited transport link. We characterize the minimum storage required at each node, derive the optimal transport policy, and quantify the trade-off between storage and transport capacities. Our results show the existence of a critical transport-capacity threshold that enables full risk pooling between the nodes. Moreover, this threshold decreases with the operating horizon, implying that full-pooling performance can be achieved with progressively smaller transport capacity over longer horizons.
comment: 9 pages, 4 figures
Nonlinear Network Identifiability with Full Excitations
We derive conditions for the identifiability of nonlinear networks characterized by additive dynamics at the level of the edges when all the nodes are excited. In contrast to linear systems, we show that the measurement of all sinks is necessary and sufficient for the identifiability of directed acyclic graphs, under the assumption that dynamics are described by twice continuously differentiable functions without constant terms (i.e., $f(0)=0$). But if constant terms are present, then the identifiability is impossible as soon as one node has more than one in-neighbor. In the case of general digraphs that may contain cycles, we consider additively separable functions for the analysis of the identifiability, and we show that the measurement of one node of all the sinks of the condensation digraph is necessary and sufficient. Several examples are added to illustrate the results.
comment: Accepted in IEEE Transactions on Automatic Control
The Iberian Blackout: A Black Swan or a Gray Rhino? A Protection-Aware Dynamic Voltage Security Assessment
On 28 April 2025, the Iberian mainland power system collapsed after a rapid voltage rise, widespread generation disconnections, and loss of synchronism. The ENTSO-E Expert Panel final report attributes the blackout to multiple interacting factors including ineffective voltage control, fixed power factor reactive behavior, fast generation ramps, protection settings not aligned with requirements, slow or unavailable reactive absorption, and limited observability outside the transmission system. This paper uses the incident as a motivating case for a broader operational voltage security problem: given the present grid state, can the next plausible trip, ramp, topology action, or shunt action push protected downstream voltages above relay thresholds before available voltage controls can respond? We develop a protection-aware dynamic voltage security assessment for this question. Starting from a nonlinear hybrid differential-algebraic equation (DAE) model, we derive mode wise finite window voltage maps that include automatic voltage regulators (AVRs), inverter-based resources (IBRs), static synchronous compensators (STATCOMs), high-voltage direct-current (HVDC) links, loads, shunts, transformers, protection functions, and limiter behavior whenever the corresponding models are available. We define normalized overvoltage margin erosion at the protection measurement side and time resolved lower bounds on useful control response. We then develop a monotone pickup cascade screen, robust data-limited certificates under uncertain relay and protected-voltage data, and a mitigation optimization that computes the minimum fast reactive action needed to keep protected voltages below relay thresholds. Case studies on a 2000-bus mechanism replica and multiple dynamic benchmark systems show that the screen predicts nonlinear cascade propagation.
Robotics
Mana: Dexterous Manipulation of Articulated Tools
Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.
comment: Project Page: https://zhaohengyin.github.io/mana
Improving Robotic Generalist Policies via Flow Reversal Steering
Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging news tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.
$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation
The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose $\texttt{WEAVER}$ (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. $\texttt{WEAVER}$ is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply $\texttt{WEAVER}$ in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). $\texttt{WEAVER}$ also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .
MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation
Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.
comment: Work in progress. Project website at https://zjunlp.github.io/LabVLA/
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models
World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.
Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments
Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.
SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale
This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.
NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation
Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/
comment: Project page: https://dachii-azm.github.io/navwam/
GIVE: Grounding Human Gestures in Vision-Language-Action Models
Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.
comment: Project page: https://luis-cloud-sg.github.io/GIVE-project/
PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update ICML 2026
While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.
comment: 30 pages, 12 figures, Accepted to ICML 2026
GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation
Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.
Real-Time Execution with Autoregressive Policies
Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.
Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications
Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.
EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection
In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.
See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation
In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.
Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes
Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.
comment: Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan
WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning
Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.
comment: 18 pages, 8 figures
Proprioceptive-visual correspondence enables self-other distinction in humanoid robots
Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.
comment: 23 pages, 9 figures, 1 supplementary table
Visual Place Recognition in Forests with Depth-Aware Distillation ICRA
Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.
comment: IEEE ICRA Workshop on Field Robotics 2026
Embedding ISO 10218 Safety Compliance in Robots via Control Barrier Functions for Human-Robot Collaboration
Human-Robot Collaboration (HRC) requires strict adherence to safety standards, such as ISO 10218, to prevent harmful interactions. Standard Speed and Separation Monitoring (SSM) filters calculate safe robotic speeds based on conservative assumptions, such as constant human velocity, which prevents accurate predictions of minimum separation distances and causes unnecessary operational halts. This paper proposes a Control Barrier Function (CBF) that explicitly incorporates human acceleration data to analytically forward-predict the minimum human-robot separation distance during a worst-case robotic stopping trajectory. To guarantee safety at the control level, this predictive CBF is integrated as an inequality constraint within a Sequential Quadratic Programming (SQP) framework. Specifically, two methods are proposed: Method I, a CBF-constrained PD safety filter; and Method II, a task-scaling SQP controller that enforces a spatial tube constraint. Simulated and real-world experiments on a UR10e robot evaluate the two proposed methods against a standard industrial SSM module baseline. Results demonstrate that Method II dynamically modulates execution speed and confines spatial deviations. Compared to Method I, Method II achieves a 63\% reduction in mean trajectory error and avoids excessive evasive manoeuvres, ensuring high task throughput while complying with ISO 10218 SSM guidelines.
Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration
While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.
comment: 19 pages, 9 figures, for associated video, see https://youtu.be/0Tav-G87XGs
Redesigning Regularization for Effective Policy Smoothing
This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.
comment: submitted to RA-L
MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer SP
Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.
comment: Accepted at SPAICE (AI in and for Space) 2026
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.
Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models
Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.
comment: 23 pages (9 main + appendices). Code: https://github.com/TimothyWang418/se3-ejepa
Effects of Social Interactions in Self-Organising Railway Traffic Management
Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation
Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant events. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA-WM, an event-aware world-model framework that augments frozen visual-feature dynamics with task-specification-grounded event prediction and verification. EA-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPOgenerated proposals. Across navigation, deformable-object, wall-constrained, and languagedescribed manipulation studies, EA-WM shows that event-aware verification can make featurespace world models more interpretable and better aligned with task progress.
Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants
Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.The supplementary video is available at https://xdei-group.github.io/Y-BotFrame/.
RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation
Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.
Comparing Commercial Depth Sensor Accuracy for Medical Applications
Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.
comment: 4 Pages
GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training
Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.
Diffusion Transformer World-Action Model for AV Scene Prediction
Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.
comment: 10 pages, 9 figures, 2 tables
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/
EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment
Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at https://frankwang67.github.io/EmbodiSteer-Page.
comment: The first two authors contribute equally
SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation
Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.
comment: Project page: https://existentialrobotics.org/serf/
Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025
As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.
comment: First, Second and Third Coauthor contributed equally to this work
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.
comment: 25 pages, 17figures
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.
comment: Project website: https://allisonandreyev.github.io/grasp.github.io/
Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer
Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.
comment: 8 pages, 4 figures, 1 table
AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots
Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $π_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.
SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture
Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.
Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids
Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.
An Attention-based Model for Robust Forecasting with Missing Modality
Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.
comment: Work originally done in 2023
Learning Dynamic Swing-Up of an Inverted Pendulum using Remote Magnetic Actuation
Electromagnetic Navigation Systems (eMNS) have gained considerable attention for minimally invasive surgery and targeted drug delivery. While most of the literature relies on quasi-static control of these systems, recent work has demonstrated the benefits of dynamic approaches. However, trajectory tracking far from equilibrium states remains largely unaddressed. We close this gap by demonstrating the first swing-up of a magnetically actuated inverted pendulum using the clinically-ready Navion eMNS. Although the inverted pendulum is not clinically relevant in itself, the proposed method utilizes torques and forces as control objectives, making it applicable to other magnetically actuated devices such as catheters and guidewires. Our approach combines trajectory optimization that accounts for internal eMNS dynamics with time-varying Linear Quadratic Regulator (LQR) state feedback and Iterative Learning Control (ILC), which leverages previous trial data and the system's dynamic model to progressively refine the feedforward command. While LQR alone fails due to the complex phenomena of magnetic actuation, ILC enables successful swing-up within six iterations. Furthermore, post-experimental analysis reveals that the learned ILC correction closely matches the torque discrepancy predicted by high-fidelity magnetic field model calibration, suggesting learning and adaptation as a promising tool to deal with uncertainties in electromagnetic actuation arising, e.g., from patient-specific physiological motion patterns and field model calibration inaccuracies.
PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation
Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.
comment: 9 pages, 5 figures, supplementary material included
Guided Diffusion with Distilled Vision-Language Reliability for Aerial Navigation
Autonomous UAV navigation is conventionally solved by pipelines that separate perception, mapping, and planning into distinct stages, which propagates errors, accumulates latency, and requires environment-specific retuning. End-to-end generative models remove these interfaces by mapping raw observations directly to trajectories, but inherit a subtle failure mode: trained on clean data, they cannot recognise when an observation is unreliable, and treat degraded regions such as glass, mirrors, and overexposed surfaces as valid evidence for planning. We present a reliability-aware diffusion planner for 3D UAV navigation. It conditions trajectory generation on the observation together with a scene-level reliability heatmap that marks where perception cannot be trusted, produced by a lightweight network that distils the open-vocabulary reasoning of a vision-language model within the real-time planning budget. To generalise to unseen environments without retraining, we steer the denoising process with a differentiable two-stage ESDF cost that treats physical obstacles from depth and virtual obstacles from highly unreliable regions on equal footing. In simulation and on a real quadrotor, our planner produces markedly safer trajectories than a state-of-the-art diffusion baseline, reducing the obstacle-violation rate from 40.3% to 9.6% and raising the mean reliability of traversed regions from 0.588 to 0.925. Ablating the reliability term alone drops mean reliability from 0.898 to 0.783, confirming it as the decisive component, while distillation runs the framework up to 2 times faster than the full vision-language model.
AnyGoal: Vision-Language Guided Multi-Agent Exploration for Training-Free Lifelong Navigation
End-to-end navigation policies trained on large simulation corpora degrade sharply when transferred to out-of-distribution scenes, categories, or goal modalities. Modular pipelines such as Modular GOAT are bottlenecked by closed-set object detection recall, while 3D snapshot-memory systems (e.g. 3D-Mem) accumulate dense, view-dependent representations that are heavy to maintain. We present AnyGoal, a training-free multi-robot architecture that places a Vision-Language Model (VLM) at the core of frontier-based exploration and coordinates agents through a shared 2D Gaussian Bayesian Value Map (BVM). The BVM maintains a per-pixel (mu, sigma^2) posterior over goal relevance, updated via precision-weighted fusion of VLM scores through a depth-cone mask, and is never reset between subtasks, yielding lifelong evidence accumulation. Frontiers are ranked by a convex blend of a VLM-as-judge softmax and a Bayesian UCB term on the BVM. A greedy allocator with spatial-separation penalty and commitment hysteresis distributes frontiers across agents without a centralized controller. On the full GOAT-Bench val unseen split (360 episodes, 2,669 subtasks), our dual-agent system achieves 52.4% Subtask SR at 12.7% SPL--state of the art under the strict physical regime (discrete 0.25 m steps, no teleportation, 42 deg HFOV) and a +27.5 pp improvement over Modular GOAT (24.9%). Single-agent AnyGoal achieves 41.9% Subtask SR, showing gains arise from the decision architecture. A four-way perception ablation shows that open-vocabulary detectors shift the dominant failure mode from exploration to goal verification.
comment: 17 pages, 3 figures
ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation
Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.
comment: 32 pages, 12 figures, supplementary material included
Output-Level Regularization Eliminates the Seed Lottery in Single-GPU VLA Fine-Tuning
Fine-tuning a vision-language-action model (VLA-JEPA) on a single GPU should be simple: load a pretrained checkpoint, run training, deploy. There is a hidden danger. Run the same fine-tuning code thirteen times -- same data, same architecture, different random seed -- and twelve runs produce a robot succeeding 91--94% of the time, while one run silently degrades to 65.2%: a 29 pp gap with no error message, no warning, and no way to predict which seed will fail. We call this the seed lottery. We trace the cause to output collapse: the action predictor quietly learns to produce nearly identical outputs regardless of what the robot sees. Existing weight-level methods (L2, EWC) are structurally blind to this collapse -- they penalize weight changes, but collapse occurs in directions weights can move freely without affecting outputs, a gap we formalize via the Jacobian null-space. Across 7 methods x up to 13 seeds x 3 LIBERO benchmarks, three output-level regularizers -- VICReg (n=12 seeds), Dropout (n=4), and a halved learning rate (n=5) -- each eliminate every catastrophic seed (0/21 combined collapses vs. 1/13 Baseline; F(12,11)=28.7, p<0.001), while weight-level methods (L2, EWC) preserve the lottery. The simplest fix is changing one number in your optimizer config.
comment: 10 pages, 8 figures, submitted to CoRL 2026
Efficient Domain-Adaptive Policy Learning via Kernel Representation with Application to Quadrotor Control under Non-Stationary Disturbances
We present an algorithm for efficient domain-adaptive policy learning via kernel representations. Learning domain-adaptive policies is challenging since it requires an environment representation that is both sufficiently expressive to model complex sim-to-real gaps during offline training, and computationally efficient enough to support rapid online adaptation during deployment. For instance, a quadrotor may encounter time-varying, non-stationary disturbances, such as sudden gusts of wind, payload shifts, or transitions between distinct flight regimes with and without ground effects. To address these challenges, we model unknown disturbances using a differentiable kernel approximation based on random Fourier features. During the offline training phase, we randomly sample kernel coefficients and bandwidth parameters to generate a rich diversity of disturbance profiles. We then optimize the control policy via differentiable simulation with analytical gradients, a process that takes only 50 seconds of training time on an RTX 4090 GPU. During hardware deployment, the policy adapts to non-stationary environments in real time by updating both the kernel coefficients and bandwidth through online least-squares estimation. We evaluate our method on quadrotor trajectory tracking tasks across high-fidelity numerical simulations and hardware experiments using Crazyflie, subjected to various disturbances, including complex aerodynamic effects, wind, ground effects, and payload fluctuations.
Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models
Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.
FlowMo-WM: A World Model with Object Momentum and Hidden Ambient Drift
World models in robot learning predict future states from visual observations and actions, enabling agents to reason about the consequences of their controls. However, many action-conditioned models are evaluated in settings where motion is dominated by immediate control, whereas aquatic surface vehicles and other real-world objects continue moving under inertia and are displaced by hidden ambient drift, such as water currents or wind. We propose FlowMo-WM, an end-to-end trainable visual world model that infers object-centric motion state and a predictive long-history context associated with hidden drift from image-action histories without direct supervision of flow fields. FlowMo-WM factorizes image-action history into a short-history latent state, trained to summarize object-centric motion, and a longer-history context, trained to summarize slowly varying exogenous influences. A zero-context residual transition separates action-conditioned base dynamics from context-dependent drift effects during latent rollout. In simulated aquatic surface-vehicle environments with diverse hidden flows, disturbances, and randomized vehicle dynamics, FlowMo-WM improves long-horizon rollout accuracy over representative action-conditioned latent world models. Prediction-time context ablations, in which the inferred context is zeroed or shuffled during rollout, show that the ambient context is important for stable prediction under hidden drift, while frozen linear probes characterize information encoded in the learned factors.
An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts
Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.
$μ_0$: A Scalable 3D Interaction-Trace World Model
World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.
Scalable Dynamic Tactile Sensing Enabled by Passive and Flexible Acoustic Waveguides
Artificial dynamic tactile sensing requires sensitivity, robustness, and compliance, yet existing technologies face trade-offs when scaling to large-area arrays, compounded by wiring complexity and cost. Here, we report a passive distributed paradigm using deep sub-wavelength acoustic waveguides that decouples performance from structural flexibility. Elastic-membrane-capped Helmholtz resonators interconnected by spring-reinforced microtubes form an enclosed network with invariant acoustic transmission under macroscopic bending. By sparsely embedding microphones, the system achieves real-time localization (4 mm highest spatial resolution; >99% accuracy in a 4 microphones 64-node sensing array) and waveform reconstruction of low-frequency signals (<100 Hz). Fast Continuous Wavelet Transform and a lightweight neural network enable inference within 5.5 ms. We demonstrate conformable prototypes-fingertip arrays, a tactile glove, and large-area skins-detecting stimuli from single-hair contact to 5-mg particle impacts, arterial pulse waves, feather touches, and finger contact. This establishes a scalable, flexible, low-cost paradigm for next-generation human-machine interfaces.
comment: 40 pages, 6 figures
Occupancy-Grounded Room Segmentation for Hierarchical 3D Scene Graphs
Hierarchical 3D scene graphs (3DSGs) for indoor robots organize geometric and semantic information across spatial scales, with a room layer that connects object-level perception to room-scale reasoning. Existing systems construct this layer from different spatial substrates (\eg{} place clusters, wall planes, or segmentation outputs), and as a result, room nodes are not evaluated on a common geometric criterion. We present an occupancy-grounded 3DSG pipeline in which room nodes are anchored to tracked free-space regions derived from occupancy decomposition, giving each room an explicit polygonal footprint. We evaluate the pipeline on 12 Matterport3D scenes by matching predicted room polygons to annotated room instances and compare against Hydra, a representative state-of-the-art place-connectivity baseline. The results show that occupancy-grounded anchoring recovers substantially more room instances than place-connectivity construction, at the cost of lower precision, and that wall-accurate room boundaries remain an open problem for both methods. Code is available at https://github.com/crcz25/OccuSG.
Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding
MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/
Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation
In this paper, a Lyapunov-based synthesis of a PI-like controller is proposed for robust trajectory tracking of an independently driven and steered four-wheel mobile robot. For the robot considered in this work, an explicit structurally verified mathematical model is used to enable systematic controller design with rigorous stability guarantees suitable for real time implementation. An augmented Lyapunov-based practical stability analysis is developed for the combined velocity-error and integral-error dynamics of the inner loop, yielding explicit bounds and sufficient conditions for practical stability and uniform ultimate boundedness of the combined velocity-error and integral-error state. The resulting control law retains a PI-like structure with model-based feedforward compensation, making it suitable for implementation on standard embedded platforms while improving robustness against configuration dependent residual dynamics and unmodelled effects. The effectiveness and robustness of the proposed design are demonstrated experimentally on a four-wheel independently steered and independently driven mobile robot platform, under both horizontal and vertical operating conditions and benchmarked against a PI controller and a sliding-mode controller.
comment: This work has been submitted to the IEEE for possible publication
RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning
Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.
From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence
The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.
comment: 53 pages, 12 figures
Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds
The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.
comment: 41 pages, 24 figures, IJRR (2026) in press
Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic
Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.
comment: Submitted to the IEEE Open Journal of Intelligent Transportation Systems (under review)
DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization
Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation
Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.
comment: 8 pages
Extending the Law of Intersegmental Coordination: Implications for Powered Prosthetic Controls
Powered prostheses are capable of providing net positive work to amputees and have advanced in the past two decades. However, reducing amputee metabolic cost of walking remains an open problem. The Law of Intersegmental Coordination (ISC) has been observed across gaits and previously implicated in energy expenditure of walking, yet it has rarely been analyzed or applied within the context of lower-limb amputee gait. This law states that the elevation angles of the thigh, shank and foot over the gait cycle covary. In this work, we developed a method to analyze intersegmental coordination for lower-limb 3D kinematic data, to simplify ISC analysis. Moreover, inspired by motor control, biomechanics and robotics literature, we used our method to extend ISC to a new law of coordination of moments. We find these Elevation Space Moments (ESM), and present results showing a moment-based coordination for able bodied gait. We also analyzed ISC for amputee gait with powered and passive prostheses, and found that while elevation angles remained planar, the ESM lacked planar coordination. We present an ISC-driven powered prosthetic control framework, using healthy coordination as a constraint to predict the shank angles/moments to compensate for alterations due to a passive foot. We developed the ISC3d toolbox that is freely available online, which may be used to compute kinematic and kinetic ISC in 3D. This provides a means to further study the role of coordination in gait and may help address fundamental questions of the neural control of human movement.
comment: Submitted to 2026 IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)
PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition
We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.
comment: 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Trojan Attacks on Neural Network Controllers for Robotic Systems
Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.
comment: Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly
Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.
comment: Project Page: https://assemlmhome.github.io/
UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data
Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.
Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory
Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.
comment: Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models ICML 2026
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.
comment: ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/
EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations
Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.
RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation
Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io
comment: 20 pages, 7 figures
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning
Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.
comment: 23 pages, 6 figures
Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving ICRA 2026
Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.
comment: Accepted by ICRA 2026, 8 pages
GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments
We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.
WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.
comment: 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)
GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert
Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.
Safety Case Patterns for VLA-based driving systems: Insights from SimLingo
Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.
Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video SC
Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.
comment: Code available at: https://github.com/UThenrik/visual_oscillators_for_SCR Dataset available at: https://zenodo.org/records/17812071 Video available at: https://youtu.be/i80H8erVISM
DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems
Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.
Active Semantic Perception
We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.
Learning Robot Safety from Sparse Human Feedback using Conformal Prediction
Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.
Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis
Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.
comment: 13 pages, 9 figures,
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.
Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles
Accurate identification of hydrodynamic derivatives is essential for precise control and autonomous navigation of Unmanned Surface Vehicles (USVs). However, acquiring high-fidelity manoeuvring data from physical sea trials is often constrained by cost, safety, and environmental disturbances. Standard manoeuvring trials, particularly Turning Circle (TC) and Zig-Zag (ZZ), remain fundamental to IMO and ITTC assessment procedures because they provide comparable performance metrics reflective of underlying hydrodynamic behaviour. This paper extends the open-source Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres. The framework provides traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. The framework also provides explicit command-execution separation for differential-thrust steering, where manoeuvre inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case study results demonstrate repeatable and IMO-compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9% between port and starboard turns, while the tactical diameter differs by 4.6-4.7%. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/-10-degree and +/-20-degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 degrees/second.
A Pragmatic VLA Foundation Model
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
comment: Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/, GM-100: https://huggingface.co/datasets/robbyant/lingbot-GM-100
Multiagent Systems
Tuning Agent-Based Predator-Prey Models Toward Lotka-Volterra Dynamics
Recent growth in compute power has made it increasingly feasible to use large-scale agent-based models to simulate complex adaptive systems. A central difficulty is that such models contain many local rules and parameters, where small changes can lead to runaway behaviour, population collapse, or saturation at artificial bounds. We study this problem in a continuous predator-prey system where sheep and wolves are active agents with local sensing, internal energy, and recurrent neural network-based controllers. We ask whether environmental and demographic parameters can be tuned so that the resulting population dynamics resemble classical Lotka-Volterra cycles. We optimise these parameters with a feature-based loss that rewards sustained oscillations, phase lag, bounded populations, and long-term persistence, first for random controllers and then for evolved controllers in a more naturalistic setting. The model is implemented in ABMax, a JAX-based agent-based modelling framework that enables efficient batched simulation on hardware accelerators.
comment: 12 pages, 3 figures
Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks
Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.
comment: 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: https://github.com/AchrafHsain7/Bastion
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch ICML 2026
Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.
comment: Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)
Reward Modeling for Multi-Agent Orchestration
Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.
comment: Preprint; work in progress
See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents
Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real "mind reading" and transfer both what one agent sees and how it thinks? Our information-structure analysis reveals a duality: context-aware transfer is driven by sparse reasoning signals, while context-unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV-cache communication via a lightweight cross-model cache transformation and two-phase training: reconstruction followed by generation. Across all six directions of {Qwen3-4B, 8B, 14B} and six in-domain and out-of-domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context-aware settings at roughly 2 to 3 times lower compute, and remains effective in context-unaware transfer where prior methods collapse.
Multiagent Protocols with Aggregated Confidence Signals
Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.
comment: 22 pages and 5 figures, 9 pages and 2 figures before the appendix
Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda IJCAI
LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.
comment: Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026
Can I Buy Your KV Cache?
Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.
LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis
Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.
$α$-fair heterogeneous agent reinforcement learning
Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to $α$-fairness welfare based on the parameter $α$. We introduce two practical algorithms, $α$-fair HATRPO and $α$-fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.
Effects of Social Interactions in Self-Organising Railway Traffic Management
Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.
The Illusion of Multi-Agent Advantage
Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.
The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale
The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.
When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation
LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.
comment: 14 pages, 10 figures
Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response
Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.
Large Language Models as Supervised Extraction Assistants: Lowering the Barrier to Documentation Standard Adoption in Agent-Based Modelling
Agent-Based Modelling (ABM) relies on clear documentation to ensure credibility and transparency. Although standards exist for documenting models (e.g. ODD), processes (e.g. TRACE, EABSS), and data use (e.g. RAT-RS), their adoption remains limited due to the effort required to produce documentation that is often treated as supplementary. This paper explores the use of Large Language Models (LLMs) to facilitate and partially automate such processes. We conduct a feasibility study focusing on the underused Rigour and Transparency Reporting Standard (RAT-RS), using four LLMs to extract reports from a published ABM paper. We assess consistency and performance across question types, finding that LLMs generate coherent outputs and perform more reliably on descriptive than on explanatory or evaluative tasks. While LLMs can improve reporting quality and consistency, they also exhibit notable limitations. We identify practical heuristics for when LLM-assisted documentation is reliable and when human oversight is needed and call for systematic community-level exploration to enhance rigour and adoption in ABM reporting.
comment: 17 pages, accepted for publication at the Social Simulation Conference 2026, see https://www.durham.ac.uk/research/institutes-and-centres/research-methods/social-simulation-conference-2026/
TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards
Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI
YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications
This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.
Human-like autonomy emerges from self-play and a pinch of human data
Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.
comment: 10 pages
Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems
As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.
comment: This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems
Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention
Fair cooperative multi-agent reinforcement learning (MARL) teams that maximize an egalitarian welfare are exploitable: a single self-interested agent free-rides on the surplus that fair agents forgo to raise the worst-off, and the known remedy is a centralized need-based allocator. We show that a decentralized defense becomes possible once contention is graded: when a contested resource still delivers a fraction $1-c$, a worst-off cooperator that contests a free-rider strictly improves on yielding, so leverage exists for every $c < 1$. We introduce CAN, a permutation-equivariant cross-attention policy over agents' observed behaviour that infers how many free-riders are present and responds proportionally -- turn-taking when none, contesting just enough when some. Trained against an adversarial league, CAN keeps best-response exploitability near the centralized oracle ($ρ\approx 1.2\text{--}1.5$ vs. $ρ= N$ unprotected) at essentially no efficiency cost, whereas the fair-MARL learners (GGF, FEN, SOTO) each collapse to an exploitable or wasteful extreme. Giving those objectives CAN's identical adversarial training does not rescue them, so the objective -- not adversarial training alone -- is what makes hardening possible. Against a committed (non-adaptive) defector, every learned defense including ours provides deterrence rather than immunity, weakening as the leverage $(1-c)/2$ vanishes. Across further environments and team sizes the same principle sets the scope: robustness holds exactly as far as the game's contest leverage reaches, and we map that boundary rather than claim to remove it.
comment: 11 pages, 10 figures
DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization
Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
The Replicator-Optimization Mechanism: A Scale-Relative Formalism for Persistence-Conditioned Dynamics with Application to Consent-Based Metaethics
This paper formalizes a widely used dynamical class--replicator-mutator dynamics and Price-style selection-and-transmission--and makes explicit the modeling choices (scale, atomic unit, interaction topology, transmission kernel) that determine how this class instantiates across domains. The backbone is known; we do not claim to have discovered selection. The novel contributions are threefold: (i) a scale-relative kernel parameterization where atomic units are themselves parameters, enabling systematic instantiation across physics, biology, economics, cognition, and social organization; (ii) a consent-friction instantiation for political philosophy, where friction is the primitive, legitimacy functions as survival probability, and belief-transfer functions as mutation kernel; and (iii) a derivation path from social contract theory rather than from biology or physics, arriving at the same formal structure via an independent route. We provide a bridge principle connecting descriptive dynamics to instrumental normativity: if agents prefer lower expected friction, then "ought" claims are shorthand for policies that reduce expected friction under the specified dynamics. This conditional structure avoids the is-ought fallacy while grounding normative discourse in empirically tractable dynamics. We address pathological cases (authoritarian stability, suppressed friction) through explicit modeling of latent versus observed friction. The framework generates testable predictions through operationalization of friction, legitimacy, and belief-transfer dynamics, and is falsifiable at the level of measurement apparatus rather than formal structure.
comment: 67 pages, 1 table, Lean 4 verification appendix (machine-checked). v2: substantially expanded from v1; adds formal-verification and identifiability sections and corrects references
Systems and Control (EESS)
Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model
Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.
Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning
This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.
comment: Preprint. 39 pages, 16 figures
MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation
Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.
Differential Geometric Conditions for Koopman Linearizability of Control-Affine Systems
Koopman linearization opens many possibilities for control synthesis and analysis of nonlinear systems. Whether or not any given nonlinear control system admits a finite-dimensional Koopman representation remains a crucial question to address. A related problem is to categorize the class of all Koopman linearizable nonlinear control systems. In this work, we present differential geometric conditions on the drift and control vector fields of a control-affine nonlinear system, that must be necessarily satisfied for Koopman linear transformation to exist. The same conditions are also shown to be sufficient for (a slightly weaker notion of) Koopman linearizability on control-invariant manifolds. Further, these conditions, together with an additional condition, become necessary and sufficient for Koopman linearizability to a controllable linear system. Our examples illustrate the ease of checking these conditions, and also shed light on how Koopman linearizing transformation may not exist for a control-affine system even though one can linearize the autonomous part of the system via Koopman lifting.
comment: 9 pages, 4 figures
Spectrum Sharing Across Terrestrial and Non-Terrestrial Services in the FR3 Upper Midband SP
The frequency bands between 7 and 24 GHz, also known as upper midband or Frequency Range (FR) 3, are being considered as an enabler of 6th Generation (6G) mobile networks. This portion of the spectrum exhibits different propagation characteristics compared to frequencies above 24 GHz, while also offering the potential to provide larger bandwidth allocations for mobile systems than those available in the sub-6 GHz range. 6G technology and spectrum policy, however, will need to guarantee coexistence with the incumbents that already use these frequency bands, which include a variety of services, from radiolocation to satellite-based communications, remote sensing, and radioastronomy. In this paper, we consider the challenge of coexistence between 6G terrestrial systems and satellite incumbents in different portions of the FR3 bands. Using a large-scale 3D model of a terrestrial deployment in the city of Boston and an open-source ray tracing solution, we evaluate the level of Radio Frequency Interference (RFI) that tens of terrestrial Next Generation Node Bs (gNBs) generate toward satellites at different elevation angles. Our model, based on realistic obstruction, clutter, diffraction, and reflections, shows that sidelobes and Non-Line-of-Sight (NLoS) paths can significantly contribute to RFI. Besides directionality, the spatial distribution of gNBs also plays a key role in defining the RFI levels, suggesting that a careful design and operation of terrestrial deployments can create coexistence opportunities.
comment: 9 pages, 6 figures, 1 table. Presented 2025 IEEE DySPAN. Copyright 2025 IEEE. Please cite it as: P. Testolina, E. Beshaj, M. Polese and T. Melodia, "Spectrum Sharing Across Terrestrial and Non-Terrestrial Services in the FR3 Upper Midband," 2025 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), London, United Kingdom, 2025, pp. 1-9, doi: 10.1109/DySPAN64764.2025.11115947
Impedance MPC with Patient-Torque Estimation for Knee Rehabilitation Exoskeletons
Knee rehabilitation exoskeletons must enforce a prescribed joint trajectory while remaining safely compliant with involuntary spasm and voluntary patient effort-objectives in tension for any fixed-gain impedance controller. We present an Impedance Model Predictive Control framework for knee rehabilitation exoskeletons, demonstrated on a series-elastic-actuator (SEA) platform: an algebraic feedforward reduces the knee dynamics to a constant-coefficient scalar double integrator, and a receding-horizon quadratic program (QP) computes corrective torques while enforcing hard range-of-motion, torque, and velocity limits (ISO 13482). A Kalman disturbance state driven by direct SEA-based torque sensing (the series-elastic spring deflection measured through the elastic element - an intrinsic, EMG-free patient-torque estimate, not a separate load cell) gives a nominal offset-free guarantee and, via its sign and the desired-motion direction, sensorless Assist-as-Needed. The constant state matrix permits offline precomputation of the QP cost inverse, enabling 500 Hz operation with a multi-step horizon. Across seven-controller benchmarks (sinusoidal tracking, isometric hold), the 500 Hz Kalman MPC is offset free 0.1 mrad RMS, 0.1 mrad steady-state, 0.2 mrad peak under 15 Nm spasm, versus a 515 mrad steady-state offset for classical impedance at the same stiffness - the direct-measurement channel converging the estimate near-immediately (within a few sampling periods). Without the estimator it realizes a classical impedance (4.8 mrad RMS, 8.3 mrad steady-state). All MPC variants meet the 87 mrad clinical criterion; no classical controller does. The architecture is formulated for the 20 DOF MyoSuite myoLeg via coupling-aware per-joint QPs.
A Reactive Redistribution Mechanism for STL Tasks in Multi-Agent Systems Under Time-Varying Communication
We present a communication-aware task decomposition framework for multi-agent systems with collaborative relative configuration objectives specified in Signal Temporal Logic (STL), allowing for dynamic task reallocation under time-varying communication networks. Building on our prior work, the framework supports the direct use of existing feedback controllers for reactive task satisfaction. We address two key challenges: disjunctive STL specifications and time-varying communication networks. Disjunctive specifications are handled through a graph transition system that captures the alternative task sequences induced by logical OR operators. To address time-varying connectivity, we introduce a redistribution mechanism that transfers tasks from disconnected agents to connected ones as the network evolves while preserving decentralized execution. Simulations and experiments on a swarm of Crazyflie drones demonstrate scalability in the number of agents, communication connectivity, and specification complexity.
Embodied Opinion Dynamics for Safety-Critical Motion Control in Dynamic Environments
This paper proposes a novel adaptive control framework that embeds nonlinear opinion dynamics within the dynamical sensorimotor layers of an automated vehicle governed by second-order nonholonomic bicycle kinematics. The framework enables an ego vehicle to perform adaptive decision-making and achieve safe motion control under interaction uncertainty with non-cooperative neighboring agents. We consider a representative case study in which an ego vehicle autonomously attempts to merge into a lane occupied by human-driven or automated vehicles whose intentions are unknown. Within the proposed framework, the ego vehicle adaptively selects and executes merging versus non-merging behaviors in response to changing environmental conditions. Formal safety guarantees, as well as equilibrium and stability analyses of the closed-loop system, are provided. Numerical simulations further demonstrate the effectiveness of the proposed approach.
When expectation fails: stochastic MPC of linear systems with random input losses
We consider stochastic model predictive control (MPC) for constrained linear systems subject to multiplicative binary input uncertainty, motivated by applications such as networked control with packet losses and intermittent actuation. A common approach in this setting replaces the stochastic dynamics with their expectation, yielding tractable formulations that admit standard terminal ingredients and stability guarantees in expectation. We show that such formulations can exhibit structural properties that differ fundamentally from those of deterministic MPC and may be misleading as indicators of realized closed-loop behaviour. In particular, the expected value function is not necessarily monotonic in the prediction horizon, and value function-based inner approximations of the region of attraction may deteriorate as the horizon increases. Furthermore, we establish a probabilistic comparison with certainty-equivalent (optimistic) MPC, showing that the latter can ensure a strictly positive probability of recursive feasibility in situations where stochastic MPC certifies feasibility but fails with probability one. These results highlight inherent limitations of expectation-based stochastic MPC for systems with multiplicative binary uncertainty and motivate a re-examination of how stochasticity is incorporated into constrained predictive control design for such systems.
comment: This work has been submitted to the IEEE for possible publication
Sizing of a grid-forming power converter to improve the small-signal stability of an LCC-HVDC system connected to a weak grid
Line-commutated converter high-voltage direct current (LCC-HVDC) has proven to be a reliable technology for bulk power transmission over long distances. However, the growing penetration of converter interfaced generation (CIG) is resulting in weaker AC grids, rendering the operation of LCC-HVDC systems vulnerable and posing a serious challenge to their stability. Grid-forming (GFM) controlled voltage source converter (VSC) have been shown to provide stabilizing impact in weak grid conditions. However, the impact of GFM controlled VSCs (GFM-VSC) on stability of LCC-HVDC in weak grid conditions has not been studied in depth in the literature. In this paper, a simplified model of LCC-HVDC is proposed and validated. Then a small-signal state-space model of a system consisting of aforementioned LCC-HVDC, a GFM-VSC and an infinite grid is developed to study the interactions between different components. The small-signal stability analysis shows the stabilizing effect of the GFM-VSC on the stability of the LCC-HVDC link in weak grid condition. Furthermore, the study on the sizing of the GFM power converter reveals that even a modest share of the capacity of the GFM power converter relative to the total nominal apparent power (sum of nominal power of LCC-HVDC and the nominal apparent power of GFM-VSC) is sufficient to ensure the stability of the system, in the test system analyzed in this study. This work just focuses in small-signal stability, but it is important to highlight that other stability phenomena should also be taken into account when selecting the final size of the GFM-VSC.
Multi-Phase Optimization of Shared Charging Infrastructure for Freight Electrification
The transition to heavy-duty battery electric vehicles requires an efficient and cost-effective deployment of the charging infrastructure, particularly when multiple operators share resources. This paper presents a multi-phase optimization framework for the joint planning of charging stations in a shared network, using high-resolution empirical truck trajectory data from two freight companies with distinct operational characteristics. The model is formulated to minimize the total number of charging stations while ensuring that the predefined electrification targets are met over successive expansion stages. The analysis captures heterogeneity in fleet usage, with one company operating a spatially concentrated network with shorter and more consistent routes, and the other exhibiting more dispersed operations with longer and more variable driving patterns. The results show that early-stage infrastructure deployment primarily supports fleets with concentrated operations, while later expansion phases are essential to accommodate long-haul and geographically dispersed transport demand. Furthermore, shared infrastructure not only enables reductions in redundant investments, but also introduces dependencies where certain fleets rely heavily on the full network to sustain electrified operations. In general, the findings highlight the importance of coordinated and data-driven infrastructure planning, and demonstrate that fleet-specific characteristics strongly influence both infrastructure requirements and electrification outcomes. The proposed framework provides practical insights on how collaborative and phased deployment strategies can enhance the scalability and efficiency of freight transport electrification.
comment: This work has been submitted to the IEEE for possible publication
Mitigating business risks from renewable PPA power sourcing uncertainties for European green hydrogen production: Robust system design, regulatory adjustments and offtake flexibility
As energy prices surge for the second time in recent years driven by the ongoing crisis in the Middle East, the European Union's continuing reliance on fossil energy imports is becoming increasingly apparent. However, despite offering an intriguing prospect of improved energy resilience, the ramp-up of local green hydrogen production lags far behind the officially stated ambitions set after the 2022 energy crisis. A prominent reason for the widening implementation gap between announced and realised production projects is overly strict rules on renewable power sourcing, prompting Member states' ministries and the European Commission to propose advancing a planned rules review from 2028 to 2026. To contribute to a successful review and rule adjustments, we address an important gap in understanding the effects of power purchase rules on green hydrogen production. By taking the perspective of European electrolyser operators, we show how the criterion of additionality and its interaction with required temporal correlation can jeopardise the fulfilment of green hydrogen offtake agreements and affect green hydrogen production costs across different European bidding zones. Applying different design paradigms to a green hydrogen production system reveals that electrolyser operator measures, such as PPA and storage upsizing, can help to mitigate the business risks posed by the additionality criterion but come with increased costs. Alternatively, relaxed temporal correlation and increased offtake flexibility both increase production system robustness and reduce production costs simultaneously. Whereby relaxing temporal correlation rules does not result in exceeded emission intensity thresholds, underlining the potential of extended transitional rules to support the ramp-up of European green hydrogen production.
MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer SP
Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.
comment: Accepted at SPAICE (AI in and for Space) 2026
Characterization and Computation of Feedback Nash Equilibria in Scalar Discounted N-Player Linear Quadratic Games
This paper studies feedback Nash equilibria (FNE) in scalar discounted linear quadratic (LQ) games with $N$ players. By explicitly incorporating the discount factor, we show that finite-cost equilibria may fail to stabilize the original system, motivating a distinction between FNE and stable FNE together with a sufficient stability condition. Based on a parametric characterization of the policies, we propose numerical methods for computing all equilibria. Particular attention is devoted to the symmetric game, where a closed-form expression of the symmetric FNE and conditions for the existence of up to $M\leq2^N-2$ equilibria are derived. Numerical experiments illustrate how equilibrium multiplicity depends on the game configuration and highlight the emergence of finite-cost non-stabilizing equilibria.
comment: Accepted for publication in IEEE Control Systems Letters (LCSS), 2026
Proportional power dispatch and fairness in wind farm power tracking
Controlling the power output of a wind farm in order to track a target signal can be useful for the power grid frequency regulation. It can be achieved by dividing the target into individual setpoints, then followed by each turbines' controller. In this article, we are interested in finding power allocations that fairly spread the power reserves (i.e. unused fraction of available powers) among turbines, helping with robustness to uncertainties and changing wind conditions. In particular, we study the fairness properties of proportional dispatch, which is the most common power dispatching method. We show that due to the wake effects in a wind farm, proportional dispatch has to be applied iteratively to achieve fair distribution of power reserves. We study the convergence of this iterative process (referred to as IPD) to equalized reserves, and then illustrate it on simulated experiments, using steady-state and dynamic simulators. Numerical results show that IPD closely approaches max-min fairness, a related fairness objective, for a cheap computational price compared to black-box optimization. Finally, IPD is also shown to reduce the complexity of the problem of fair power dispatch combined with yaw wake steering optimization.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/
Data-Driven Frequency-Selective Output Regulation of Nonlinear Systems under Almost Periodic Exosignals
This paper studies output regulation for a class of unknown continuous-time nonlinear systems driven by almost periodic exosignals. The plant dynamics are assumed to be linearly parameterized over a prescribed nonlinear dictionary, while all coefficient matrices in the plant, input channel, output map, and exosignal channel are unknown. Since the plant model is unavailable, exact nonlinear output regulation would generally require model identification followed by the solution of nonlinear regulator equations. To avoid these steps, we pursue a frequency-selective regulation objective: the steady-state regulation error is allowed to be almost periodic, but its Fourier-Bohr coefficients at prescribed exosystem frequencies are guaranteed to vanish, and the residual error energy is explicitly bounded. To this end, a p-copy internal model is embedded in a dynamic controller, yielding an augmented nonlinear system whose unknown constant matrices are represented directly by measured data. A noise-robust semidefinite program is derived to synthesize the controller gain without model identification and without measuring the exosignal amplitudes or phases. The resulting closed-loop vector field is made exponentially contractive on a prescribed operating set, which implies the existence and uniqueness of a bounded and attracting trajectory. By combining contraction theory with Fourier-Bohr analysis, we prove that this steady-state trajectory is almost periodic, that the embedded-frequency components of the regulation error are eliminated, and that the unmodeled spectral components satisfy a Parseval-type time-averaged energy bound. Numerical and physics-based simulations on a quadrotor with a cable-suspended payload illustrate the effectiveness of the proposed data-driven internal-model design.
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.
comment: Project website: https://allisonandreyev.github.io/grasp.github.io/
Computing Headway Bounds under Worst-Case Bunching in Fixed-Line Transit Systems
Vehicle bunching is a major problem for transit operators. When vehicles bunch together, the lead vehicle will service the majority of passenger demand, leaving the following vehicles to operate below capacity, wasting fuel and money. Furthermore, after the last vehicle in the bunch passes, the time before the next vehicle's arrival (headway) will be large. Transit operators can combat bunching by holding buses at stops along a route, trading riding time for even headway times. While prior work has focused on developing holding policies to minimize average case bunching, no work has focused on analyzing the longest and shortest possible headway times under a broad group of such policies. We assume that dwell times at stops and travel times between stops are bounded and develop a dynamic program that computes the maximum and minimum headway times for a single bus route with an arbitrary number of control points, vehicles, and holding policies. These bounds are tight in the sense that it is always possible to identify the specific sequence of events that lead to their occurrence. We use these bounds to investigate the effects of different holding policies, stop placement, and number of vehicles on route headways and worst-case bunching. Finally, we apply these analysis techniques to a real-world transit system in Nashville, TN and show their utility for transit planning.
comment: 11 pages, 9 figures, to be presented at the 2026 IEEE 32nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)
Homotopy-Based Re-Initialization for Switched DAEs in Power System Transient Simulation
The simultaneous solution of switched differential-algebraic equations (DAEs) in power system transient simulation may suffer convergence loss following discontinuous events. This difficulty is typically interpreted as a poor post-event initialization problem. This letter presents a geometric framework that explains the underlying convergence mechanism and clarifies why standard convergence-restoration methods may fail at discontinuities. Based on this interpretation, a homotopy-continuation based globalized re-initialization scheme is developed to restore convergence. The proposed method is validated through numerical simulations of representative discontinuities in power system transient simulation. Results show that in the cases where direct post-event solution fails, the proposed scheme can reliably recover convergence.
comment: Manuscript submitted to IEEE Power and Energy Society Letters and is currently under revision
Pushing the Frontiers for Floating Solar Photovoltaics -- The Case for South America
Floating solar photovoltaic (FSPV) systems provide a land-efficient pathway to expand clean electricity access in energy-poor regions. South America has among the highest global FSPV potential (approx 38.26 TWh per million acres of water surface), yet deployment remains limited. This study presents a techno-socio-economic framework to assess FSPV for energy access, water security, and grid flexibility, with case studies in Nicaragua, Honduras, and Guyana. Estimated yields for 50 to 398 MW systems exceed 1,500 to 2,000 kWh per kW annually with capacity factors above 20 percent. At El Cajon, FSPV could significantly reduce emissions relative to fossil generation. Results show competitive costs with land-based PV when accounting for avoided land use, shared hydropower infrastructure, and water benefits. The framework also highlights co-location with hydropower and AI data centers, offering a scalable model for deployment in underserved regions.
comment: 63 pages, 20 tables, 18 figures
The GIST 2064-Bus Test System: A Public-Data Synthetic Model of the Korean Power Grid
No model of the Korean transmission system at native resolution is publicly available, which makes reproducible research on one of the world's most distinctive grids difficult-an islanded interconnection with extreme separation between generation and the Seoul Metropolitan Area load center, low renewable penetration, and heavy reliance on extra-high-voltage (EHV) transmission. Working strictly from public data, and for research purposes only, we present the GIST 2064-bus test system, a geographically grounded synthetic model of the Korean grid. Unlike fully synthetic cases, whose lines match no real corridor, and aggregated public Korean models, it derives its 345 and 154 kV layout from the OpenStreetMap/OpenInfraMap power layer by a multi-source shortest-path reassembly of overhead-line geometry, gap-fills unreachable substations with a geographic minimum-spanning-tree backbone, and calibrates the aggregate circuit length to published national statistics (108/107/97% at 765/345/154 kV). The model spans 2064 buses, 512 generation and renewable sources (144 GW), 3044 AC line circuits plus high-voltage direct-current (HVDC) equivalents, 3073 transformers, and reactive resources (shunts and 11 FACTS devices), serialized to a PSS/E-compatible CSV schema. A general-purpose pandapower Newton-Raphson solver-with generator reactive limit enforcement, a secant-gain remote voltage-control loop, tap-changer and switched-shunt fixed-point control, and zero-impedance regularization-solves an 85 GW high demand snapshot to a single connected, converged operating point (mean voltage 0.996 pu, 2.3 % losses, no undervoltage buses), structurally consistent with the independent public KPG-193 model. The dataset, maps, and tooling are released as a citable platform for power flow, planning, and decarbonization studies.
comment: 10 pages, 5 figures, 5 tables
To Share or Not to Share: Orchestrating Trustworthy Data in Global Value Chains
As the EU Carbon Border Adjustment Mechanism (CBAM) approaches, the global semiconductor value chain faces growing structural tensions between regulatory transparency and data sovereignty. This article proposes a RegTech reference architecture using the International Data Spaces (IDSA) framework to orchestrate trustworthy environmental telemetry across the semiconductor-petrochemical nexus. The framework distinguishes the mandatory CBAM requirements from voluntary Science Based Targets initiative (SBTi) frameworks, while addressing the additive complexities of the Safe-and-Sustainable-by-Design (SSbD) framework. Moving beyond standard linear technology stacks, we introduce a prospective roadmapping methodology that transforms upstream physical vulnerabilities into circular, negative feedback loops. Focusing on the Taipei and Penang technology corridor, the article details how sovereign data exchange enables Digital Product Passports (DPPs) to drive Global Business Services (GBSs) capability demands. Finally, we discuss the integration of Agentic AI for autonomous compliance and FinTech green financing, providing a scalable blueprint for global industrial clusters to achieve sovereign, sustainable, and transparent value chains.
Orchestrating the Twin Transition in Multinational Corporations: Technology Roadmapping for Green and Digital Global Business Services
Global Business Services (GBS) have emerged as a "living laboratory" for the Twin Transition of Green and Digital Transformation, as multinational corporations (MNCs) face increasing pressure to harmonize digital efficiency with environmental stewardship. Aiming to derive a socio-technical framework, this paper synthesizes Technology Roadmapping (TRM) with the International Telecommunication Union (ITU) ICT-centric innovation ecosystem toolkit. A bibliometric analysis of research clusters reveals an evolutionary shift from basic process automation toward "Sustainable Intelligence," identifying the GBS unit as a central "operational airlock" that mediates between landscape pressures -- such as the EU's dual mandate and Carbon Border Adjustment Mechanisms -- and niche innovations in AI-native workflows. The study further maps these clusters onto a stakeholder engagement canvas, highlighting how resilient "Middle Power" hubs in Poland, Portugal, and Malaysia are bypassing the middle-income trap to provide a "third way" for global value chains amidst a bifurcated geopolitical cloud. The results offer a data-driven design approach for leaders and entrepreneurial support networks to orchestrate talent and supply chain flows, thereby enriching the conceptual understanding of Industry 5.0 and the role of GBS as a primary mechanism for navigating a volatile, multipolar digital economy.
comment: 9 pages, 6 figures
Agentic MPC for Semantic Control System Resynthesis
While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.
comment: 7 pages, 5 figures
Patching Control Lyapunov Barrier Functions for Temporal Logic Specifications with Bounded Controls
We propose an abstraction-free framework for controller synthesis for continuous-time dynamical systems subject to Linear Temporal Logic (LTL) specifications and bounded control inputs. The proposed method combines the sequential decomposition of LTL tasks with the use of formally certified Control Lyapunov-Barrier Functions (CLBFs). By formulating local specifications as a sequence of safe-stabilization problems, we systematically approximate and patch the winning sets of the decomposed subtasks. The satisfaction of these local constraints is guaranteed by the offline-computed level sets of the CLBFs. As a result, our framework yields formally verified switching feedback controllers that enable efficient online planning and dynamic re-planning. This ensures robust continuous specification satisfaction in the presence of state perturbations, avoiding the explicit state-space abstractions commonly required in the literature. The approach is validated through numerical simulations and a hardware demonstration on a Crazyflie quadrotor.
Learning Dynamic Swing-Up of an Inverted Pendulum using Remote Magnetic Actuation
Electromagnetic Navigation Systems (eMNS) have gained considerable attention for minimally invasive surgery and targeted drug delivery. While most of the literature relies on quasi-static control of these systems, recent work has demonstrated the benefits of dynamic approaches. However, trajectory tracking far from equilibrium states remains largely unaddressed. We close this gap by demonstrating the first swing-up of a magnetically actuated inverted pendulum using the clinically-ready Navion eMNS. Although the inverted pendulum is not clinically relevant in itself, the proposed method utilizes torques and forces as control objectives, making it applicable to other magnetically actuated devices such as catheters and guidewires. Our approach combines trajectory optimization that accounts for internal eMNS dynamics with time-varying Linear Quadratic Regulator (LQR) state feedback and Iterative Learning Control (ILC), which leverages previous trial data and the system's dynamic model to progressively refine the feedforward command. While LQR alone fails due to the complex phenomena of magnetic actuation, ILC enables successful swing-up within six iterations. Furthermore, post-experimental analysis reveals that the learned ILC correction closely matches the torque discrepancy predicted by high-fidelity magnetic field model calibration, suggesting learning and adaptation as a promising tool to deal with uncertainties in electromagnetic actuation arising, e.g., from patient-specific physiological motion patterns and field model calibration inaccuracies.
TetraRL: A Self-Adaptive Runtime for On-Device Deep Reinforcement Learning Systems
Autonomous robotic systems, including autonomous vehicles, drones, and mobile robots, increasingly rely on on-device Deep Reinforcement Learning (DRL) to adapt to dynamic environments. Unlike cloud-based solutions, embedded DRL must perform training and inference directly on resource-constrained hardware while maintaining timely decision-making. This creates a fundamental challenge: balancing four tightly coupled objectives, real-time performance, task reward, memory utilization, and energy consumption. Optimizing these objectives independently often leads to suboptimal behavior, while conventional multi-objective methods may violate resource constraints and compromise reliability. This paper presents TetraRL, a self-adaptive runtime framework for tetra-objective on-device DRL. TetraRL formulates embedded DRL as a unified optimization problem over real-time, reward, RAM, and reserve (energy) objectives, and employs a preference-conditioned reinforcement learning controller to dynamically navigate the resulting trade-off space. The framework integrates a unified resource-management abstraction, hardware-aware DVFS control, and a runtime Override Layer for robust constraint enforcement. We implement TetraRL on NVIDIA Jetson AGX Orin and Orin Nano platforms and evaluate it across diverse DRL environments. Results show that TetraRL effectively balances all four objectives, achieves competitive trade-offs under varying runtime preferences, and incurs negligible overhead. Moreover, a single trained policy can support runtime-switchable optimization goals, providing a practical foundation for resource-aware and self-adaptive on-device DRL.
comment: Extension version of RTSS'23 and RTSS'24
Spatial Load Correlation in AI Data-Center-Dominated Power Systems
The proliferation of large-scale data centers introduces spatially correlated demand profiles that challenge the long-standing assumption of statistical independence of loads in power system analysis. This paper examines the emergence of such load correlations and evaluates their impact on data-center-dominated grids. Analytical derivations reveal that correlated load fluctuations amplify aggregate stochastic disturbances, reduce voltage stability margins through weakened reactive power stiffness, and degrade frequency stability margin by erosion of natural load diversity effects. Real-time digital simulation studies confirm that moderate spatial correlation in distributed data centers produces simultaneous frequency deviations and voltage fluctuations across multiple buses. The findings offer transmission system operators a physics-based perspective to interpret emerging oscillatory phenomena and establish stability planning criteria grounded in measurable load-correlation structures rather than traditional diversity assumptions.
comment: To appear in proceedings of 2026 IEEE Power & Energy Society General Meeting, 19-23 July 2026, Montréal, Canada
Modal Analysis of Spatial Load Correlation in AI Data Center-Dominated Power Systems
Hyperscale AI data centers induce spatially and temporally correlated load fluctuations that violate classical independence assumptions and are not captured by time-averaged spectral methods. These correlations are episodic and non-stationary, requiring analysis that resolves transient structure. This paper applies Dynamic Mode Decomposition (DMD) to the temporal evolution of pairwise inter-bus correlation coefficients to form a low-dimensional state representation that enables modal analysis without a stationarity assumption. DMD eigenvalues encode the correlation regime: their location in the complex plane distinguishes sustained coherence, decaying transients, and intensifying events, while oscillation frequency maps to underlying physical coupling mechanisms. Using an IEEE 39-bus Real-Time Digital Simulator (RTDS) testbed with three converter-interfaced AI data center loads driven by synthetic workload profiles, global DMD provides a time-averaged modal baseline in a slow thermal band ($f \approx 0.005$\,Hz, $|μ| = 0.91$) captures 93.6\% of total correlation energy. A sliding-window DMD formulation identifies transient intensification events: 51 of 775 windows (6.6\%) satisfy the $|μ_k^{(n)}| > 1$ criterion, which aligns with stochastic workload coincidences. Cross-validation with RTDS voltage coherence confirms elevated coupling during these intervals. The proposed modal growth indicator provides an early-warning signal of correlation intensification prior to peak pairwise coherence.
comment: To appear in proceedings of 8th International Conference on Smart Energy Systems and Technologies, September 2-4, 2026 | Ciudad Real, Spain
Solving Subgraph Extraction Problems Using $Δ$Search
Many NP-hard graph problems can be modeled as optimal subgraph extraction problems with feasibility constraints. From Network Design to Facility Location, from Robotics to Graph Drawing, the subgraph extraction pattern emerges across diverse domains. Despite this commonality, these problems are typically solved with domain-specific heuristics. Usually, these problems balance competing objectives such as maximizing coverage or minimizing cost while satisfying structural constraints such as connectivity, planarity and reachability. In this work, we introduce $Δ$Search, a general and fast heuristic framework that exploits the insight of Reward-Penalty optimization for solving a large class of subgraph extraction problems. The framework is easy to use as it only requires feasibility constraints and optimality criteria to be provided by the user to express the subgraph extraction problem. We also show how exact methods can be augmented with $Δ$Search to improve their performance by aggressive pruning of the search space. We evaluate our framework on monotone graph problems such as Maximum Planar Subgraph (MPS) and Minimum Connected Dominating Set, Weighted Monotone problems such as Maximum Weighted Independent Set and Minimum Weighted Steiner Tree, and non-monotone graph problems such as Prize Collecting Vertex Cover (PCVC) and Uncapacitated Facility Location Problem (UFLP). Our results show that $Δ$Search matches or surpasses state of the art heuristics for MPS, UFLP and PCVC problems with similar runtime. For the remaining problems, $Δ$Search achieves approximately 89% of the solution quality of the state-of-the-art algorithms without any problem-specific tuning
comment: 23 pages, 13 figures, 10 Tables
An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts
Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.
Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems
As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.
comment: This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems
Closed-Form PI and PID Tuning of All-Pole Plants up to Third Order for Monotonic Minimum-Settling Step Responses
A unified, closed-form analytical PI/PID tuning method is presented for all-pole plants up to third order that yields a strictly monotonic (zero-overshoot) step response with minimum settling time. The design target is the binomial closed loop p^n/(s+p)^n, which is monotonic with robustness depending only on the order n. Because a fixed PI/PID cannot assign the closed-loop poles and the controller zeros independently, realizing this target exactly requires the controller zeros to be cancelled, which forces the controller numerator to divide the plant denominator. It follows that an exact, real-gained solution exists for any stable plant precisely up to second order with a PI controller and third order with a PID controller; beyond that the residual binomial factor acquires a complex pair of damping sqrt(3)/2, which a generic plant does not contain. Explicit gains are derived for first-order plants (PI), second-order plants with real and complex poles (PI and PID), and third-order plants with three real poles or one real pole plus a complex pair (PID). The freedom of the coincident designs is shown to be bounded: a quadratic nonnegativity condition gives the exact window of the design pole for strict monotonicity, which collapses at the pole-ratio-2 changeover for real poles and is nonempty for damping ratios above approximately 0.443 for complex poles. Monotonicity guarantees Mt = 1, hence Ms <= 2, phase margin >= 60 degrees, and gain margin >= 6 dB, tightening to universal constants for the binomial family. Load-disturbance attenuation obeys IAEd = 1/Ki, making the cost of cancellation explicit, and comparisons with SIMC, the CHR zero-overshoot rule, and deadbeat-fitted explicit formulas quantify the trade: at matched maximum sensitivity the proposed design settles faster than SIMC on the third-order example, with markedly lower controller gains and peak control effort.
comment: v2: extended with monotonicity windows, third-order boundary theorem in final form, and comparisons; subsumes arXiv:2604.21294
Metriplectic Conditional Flow Matching for Dissipative Dynamics
Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.
Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds
The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.
comment: 41 pages, 24 figures, IJRR (2026) in press
Coupling Scenario-Based Grid Simulations with State Estimation: Measurement Requirements for Low-Voltage Networks under the German Energy Transition Pathway
Increasing penetration of electric vehicles, heat pumps, and rooftop photovoltaics is creating thermal and voltage stress in low-voltage distribution grids. This work links the German Federal Government energy transition pathway (2025-2045) with state estimation performance requirements, evaluated on two SimBench reference networks across three equipment quality levels (good, medium, poor) and three VDE Forum Netztechnik/Netzbetrieb (VDE FNN) measurement constellations that differ in the availability of transformer and feeder-level instrumentation. Within this work's analysis, congestion is caused exclusively by transformer overloading and voltage-band violations. No individual line exceeds its thermal rating (maximum: 89.5%). Equipment quality governs congestion onset for a given deployment trajectory: under good equipment, congestion remains absent through 2045, under medium equipment it emerges from 2035 (3/6 scenarios), under poor equipment from 2025 (6/6). Without transformer instrumentation, median voltage estimation errors reach 6-42% regardless of smart meter penetration. Adding a single transformer measurement reduces errors by an order of magnitude, achieving median errors of 0.5-1.7%. In urban networks, transformer-level instrumentation meets the VDE FNN voltage accuracy target (99th percentile voltage error below 2%) in all configurations. In rural networks under poor equipment, the target is approached but not met. These findings motivate prioritizing transformer instrumentation as an effective first step for grid observability and supplementing the current consumption-driven metering rollout with risk-based deployment criteria linked to local congestion exposure.
Nonlinear backstepping with saturation for low-thrust station-keeping of libration point orbits
This paper presents a novel nonlinear backstepping control law for continuous, low-thrust station-keeping in the Earth-Moon system. Quasi-periodic libration point orbits are targeted under a high-fidelity model of the dynamics. Almost global uniform exponential stability guarantees are attained, as shown through Lyapunov's stability theory. Saturation of the actuators is formally included in the controller design, such that these guarantees hold even in the event of saturation. The relationship between saturation threshold, control gains, and deviation is studied and an optimal procedure for gain selection is discussed. The control solution is tested numerically through a Monte Carlo analysis over representative application cases, subject to operational errors, constraints, and external perturbations. Station-keeping under actuation saturation is validated considering a conservative threshold for typical electric propulsion systems.
comment: Manuscript accepted for Acta Astronautica. Please cite the published version. For a working demo of the solution proposed, see https://github.com/antoniownunes/NL_SK_mwe
Stability properties of Minimal Gated Unit neural networks
In this work, we address the need for efficient and formally stable Recurrent Neural Networks (RNNs) in environments with limited computational resources by analyzing the stability of the Minimal Gated Unit (MGU) network, a lightweight alternative to common gated RNNs used in system identification. We derive sufficient parametric conditions for the MGU network's input-to-state stability and incremental input-to-state stability properties. These conditions enable a-posteriori validation of model stability and form the basis for novel stability-promoting training methodologies, including a warm-start of the network's parameters and a projected gradient-based optimization scheme, both of which are presented in this work. Comparative evaluation, including robustness analysis and validation on synthetic and real-world data (i.e., the Silverbox benchmark), demonstrates that the minimal gated unit network successfully combines formal stability guarantees with superior parameter efficiency and faster inference times compared to other state-of-the-art recurrent neural networks, while maintaining comparable and satisfactory accuracy. Notably, the results attained on the Silverbox benchmark illustrate that the stable MGU network effectively captures the system dynamics, whereas other stable RNNs fail to converge to a reliable model.
comment: Preprint submitted to Automatica. 16 pages, 6 figures and 1 table MATLAB code for the proposed methodologies is available at: https://github.com/StefanoDeCarli/MGU_dISS.git
DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization
Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.
Trojan Attacks on Neural Network Controllers for Robotic Systems
Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.
comment: Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)
Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory
Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.
comment: Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures
Towards Efficient and Secure Cloud-Assisted Autonomous Systems: A Review of Architectures, Algorithms, Security, and Deployment Challenges
Networked Control Systems (NCSs) have been instrumental in realizing fully connected and responsive intelligent environments within the context of real-time virtual control and management. However, traditional NCSs face considerable challenges in handling the vast amounts of data generated by large-scale control applications, particularly in terms of data acquisition, storage, and computational processing. To address these challenges, the emergence of cloud computing and advancements in control theory have empowered the new paradigm known as Cloud Control Systems (CCSs). Recently, CCSs have received substantial attention from industries for their potential properties, such as large-scale data management, complex computations, and data-centric optimized decisions. This study presents an extensive review of recent progress in CCSs spanning over multiple studies published between 2012 and 2025. Specifically, the focus is on providing a taxonomy of the current findings in CCS research, encompassing various perspectives, such as its efficient implementations in industrial automation, security and privacy considerations, and cloud-based control techniques. Each category is examined in depth through selected state-of-the-art analyses of different approaches and contrasting methodologies. Furthermore, we discuss future directions aimed at designing more efficient and practical CCSs. The insights gained from this study can help researchers, practitioners, and decision-makers in their domain for effective CCS design and deployment.
comment: 61 pages, 11 Figures
Disturbance Attenuation Regulator II: Stage Bound Finite Horizon Solution
This paper develops a generalized finite horizon recursive solution to the discrete time stage bound disturbance attenuation regulator (StDAR) for state feedback control. This problem addresses linear dynamical systems subject to stage bound disturbances, i.e., disturbance sequences constrained independently at each time step through stagewise squared two-norm bounds. The term generalized indicates that the results accommodate arbitrary initial states. By combining game theory and dynamic programming, this work derives a recursive solution for the optimal state feedback policy. The optimal policy is nonlinear in the state and requires solving a tractable convex optimization for the Lagrange multiplier vector at each stage; the control is then explicit. For systems with constant stage bound, the problem admits a steady-state optimization expressed as a tractable linear matrix inequality (LMI) whose empirical computational cost is approximately cubic in $n$. Numerical examples illustrate the properties of the solution. This work provides a complete feedback solution to the StDAR for arbitrary initial states. Companion papers address the signal bound disturbance attenuation regulator (SiDAR): the finite horizon solution in Part~I-A and convergence properties in Part~I-B.
Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks
Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.
Reformulating dq Impedance Matrices via Pauli Decomposition for Root-Cause Analysis of Instabilities in Grid-Connected Converters
The increasing penetration of converter-interfaced generators in power systems has led to the adoption of impedance-based criteria as an alternative framework for assessing and ensuring stable integration. However, when the impedance criterion is used, identifying the root cause of instabilities is generally more challenging compared to other approaches, such as modal analysis. Moreover, the eigenvalues and characteristic equation used in the impedance criterion are non-linear functions, making it difficult to establish a clear relationship between impedance components and closed-loop stability. To address this issue, this paper proposes the application of the Pauli decomposition to analyse dq impedance matrices and minor-loop equations. By using this decomposition technique, the dq representation can be reformulated into a quaternion-like form, which has explicit algebraic relationships with the determinant, trace, eigenvalues, and characteristic equation. Moreover, this decomposition enables systematic assessment of the influence of each impedance term in the system stability, thus facilitating finding the root-cause of instabilities. The primary objective of this work is to develop the mathematical foundation of the Pauli decomposition and demonstrate its implications for root-cause analysis. The theoretical contributions are validated using a case study consisting of a converter-interfaced generator connected to a weak grid that has been previously analysed in the literature using existing techniques. The proposed Pauli decomposition provides an algebraic tool that enhances interpretability of impedance-based stability analysis and establishes a basis for further investigation of complex converter interactions.
comment: The source code used to obtain the results can be found here: https://doi.org/10.5281/zenodo.17829032
Equivalent modelling for the fundamental frequency dynamic variation: State-space, impedance, and power-frequency representations
Stability of power electronic converters connected to power grids is commonly assessed by using the impedance criterion while the stability of power grids is typically analysed by using the network state-space representation. It is known that the impedance criterion may lead to erroneous results if the grid frequency dynamics are not considered while eigenvalue analysis is considered as a reliable method for system stability assessment. The equivalence between these two methods has been recently explored, without considering the effect of network frequency variations. Additionally, the link of the impedance criterion with the power-frequency dynamics of power systems also remains largely unexplored. In this paper, the equivalency between the impedance method considering the grid frequency dynamics and the conventional eigenvalue analysis is demonstrated. In addition, the dynamic interaction between the apparent power flow and the network fundamental frequency is formulated and its link with the impedance representation is shown. It is demonstrated that, by using the impedance representation with the network frequency as an additional input port, the network frequency perturbation plot (NFP) can be intuitively expressed by using quantities consistent with the impedance analysis framework. The main findings are verified using detailed numerical simulations of two representative systems.
Some Essential Constructive Foundations for Systems and Control
This work develops several constructive foundations for systems and control within Bishop-style constructive mathematics. For an engineer, the guiding principle is that an object claimed to exist, such as a trajectory, an optimal control law, a selector, or a viable solution, should come with finite data and an operation computing approximations to any prescribed precision. The style remains close to classical analysis, but existential statements are organized so that their computational content is visible. The paper begins with elementary geometric data in finite-dimensional Euclidean spaces: blocks, multiblocks, representable sets, regular functions, and certified integrals. This set-first integration route is meant to complement, rather than replace, abstract constructive integration theories such as Daniell-type or integration-space approaches. The developed apparatus is then applied to a constructive functional extremum-value theorem, selector extraction for multifunctions, Filippov-type and viable solutions of differential inclusions, regular probability densities, controlled Markov chains, and empirical density certificates. A short account of resolvent projectors and linear stability is included for completeness.
comment: 130 pages, 15 figures
LoRa and LoRaWAN simulator-cum-emulator with CAD and capture effect in Python
Existing LoRaWAN/LoRa simulators consist of large, complicated C++ codebases and often do not support all device classes. This paper presents the design of a simple to use, Python-based discrete-event simulator that addresses these gaps while also introducing a novel method for evaluating real device firmware in the simulator. The simulator is built on a custom asyncio-based simulation kernel, a three-phase packet delivery model that reproduces the capture effect, a full LoRaWAN 1.0.4 stack, and a containerized firmware system that cross-compiles real STM32 C firmware and redirects HAL calls into the simulator via CFFI. The simulator is distributed as a Python package via Github (https://github.com/MatthijsReyers/lora-simulator) and requires no external simulation framework or dependencies.
comment: Totally 11 Pages; Github link ncluded
Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles
Accurate identification of hydrodynamic derivatives is essential for precise control and autonomous navigation of Unmanned Surface Vehicles (USVs). However, acquiring high-fidelity manoeuvring data from physical sea trials is often constrained by cost, safety, and environmental disturbances. Standard manoeuvring trials, particularly Turning Circle (TC) and Zig-Zag (ZZ), remain fundamental to IMO and ITTC assessment procedures because they provide comparable performance metrics reflective of underlying hydrodynamic behaviour. This paper extends the open-source Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres. The framework provides traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. The framework also provides explicit command-execution separation for differential-thrust steering, where manoeuvre inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case study results demonstrate repeatable and IMO-compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9% between port and starboard turns, while the tactical diameter differs by 4.6-4.7%. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/-10-degree and +/-20-degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 degrees/second.
Robotics
FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning
Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at https://jasonjzliu.com/factr2
comment: Website at https://jasonjzliu.com/factr2
World Pilot: Steering Vision-Language-Action Models with World-Action Priors
Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/
comment: Project Website: https://world-pilot.github.io/
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.
VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving
Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.
comment: Project page: https://yaojin17.github.io/VLGA/
Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration
Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.
UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning
Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.
comment: Project page: https://denghaoyuan123.github.io/UniIntervene-project/
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $π$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics
We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.
comment: 14 pages (main body), 52 pages total. Project website: https://ambient-diffusion-policy.github.io/
CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy
Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.
comment: Project Website: https://chorus-model.github.io
Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles
Accurate identification of hydrodynamic derivatives is essential for control and navigation of Unmanned Surface Vehicles (USVs), but high-fidelity manoeuvring data from physical sea trials are constrained by cost and safety. Turning Circle (TC) and Zig-Zag (ZZ) trials remain fundamental to IMO and ITTC assessment procedures. This paper extends the Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres, with traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. Another feature is explicit command-execution separation for differential-thrust steering, where inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case-study results demonstrate repeatable and compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9 percent between port and starboard sides, while the tactical diameter differs by approximately 4.6 to 4.7 percent. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/- 10 degree and +/- 20 degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 deg/s. Overall, the framework provides a repeatable and auditable virtual sea-trial workflow for generating IMO/ITTC-aligned datasets and supporting system identification, hydrodynamic-derivative estimation, and digital-twin calibration.
Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments
Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/FAST-SDE.
comment: To appear in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Fourier Features Let Agents Learn High Precision Policies with Imitation Learning ICML 2026
High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: https://fourier-il.github.io/fourier-il
comment: Published as a conference paper at ICML 2026
UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief
Safe ground navigation in large, threat-augmented environments requires aerial support that actively reduces the risks that a ground vehicle faces along its route. Existing aerial reconnaissance systems focus on mapping or covering the environment, but do not direct sensing toward regions that are most relevant for ground vehicle safety. In this paper, we address the problem of coordinating a team of unmanned aerial vehicles (UAVs) to improve the safety of an unmanned ground vehicle (UGV) navigating through unknown threat zones. A key aspect of our approach is a shared exposure belief that is updated online from aerial observations and used jointly by the UAV team and the ground vehicle. This enables us to direct aerial sensing towards route-relevant regions while allowing the UGV to replan around newly revealed threats. We coordinate the UAV team through spatial region assignment to avoid redundant sensing. Simulation experiments show that our approach reduces cumulative UGV exposure by 38% compared to a system that does not account for hazard levels, and reduces redundant aerial coverage from 38.8% to 3.7% under our multi-UAV coordination scheme.
comment: 8 pages, 6 figures
Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering
Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.
comment: 22 pages, 14 tables, 14 figures
DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems
Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.
Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.
Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends
Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.
AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents
Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.
PEBRE: An Open-Hardware Compute and Perception Add-On for the Pepper Robot
This paper presents the design, development, and experimental verification of PEBRE, an open-hardware add-on for fast software development on the Pepper Robot. Our project enhances Pepper's computational and perception capabilities by integrating external components such as a Jetson Orin Nano, Logitech BRIO, Intel RealSense D435i, Samson UB1, and RØDE VideoMicro II. Our results show that the new hardware considerably improved Pepper's perception abilities and computational power. This development contributes to the community by implementing an open hardware and open-source modular add-on to the Pepper robot and keeping this relevant research platform functional beyond its expected lifespan. With PEBRE, we aim to facilitate faster software development and more efficient integration of external components, ultimately enhancing the capabilities of the Pepper robot.
Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning
Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model
Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}
comment: 17 pages, 8 figures
Fibration Trees: A Unified Approach to Multi-Robot Motion Planning
State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.
comment: 23 pages, 12 figures
Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom
High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: https://github.com/balazsgyenes/kirurc
comment: 8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL)
KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility
Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.
comment: Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE)
VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network
Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: https://scaling-group.github.io/vicx/.
comment: The first two authors contributed equally to this work
Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning
Collision-free path planning in cluttered, real-world environments relies on a representation of the collision-free space, and existing representations broadly fall into two categories. Explicit representations, such as unions of convex sets, can be plugged into optimization-based planners as hard collision-free constraints, but their parameters scale poorly with configuration-space dimension. Implicit representations, by contrast, are flexible and scale well to complex geometries, yet typically lack such guarantees. We bridge this gap with ILD (Invertible Latent Decomposition), a framework that jointly learns an invertible mapping and a union of explicit convex polytopes in the resulting latent space. Planning is carried out over these latent convex sets, and the invertible mapping decodes the resulting paths back to the original configuration space while preserving feasibility with respect to the refined explicit safe regions. We further propose Visibility-Guided Sampling (VGS) to keep the convex sets connected for path planning. Across 2D navigation, 6-DoF, and 14-DoF manipulation environments, ILD achieves broader coverage, better inter-set connectivity, and higher path-planning success rates than prior baselines, with zero observed false positives after test-time refinement. On a 14-DoF bimanual manipulator, we further demonstrate real-time collision-free planning, with test-time refinement adapting to scene-geometry changes during real-world deployment on a single 6-DoF arm.
MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs
Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.
Deformable In-Hand Slip-Aware Tactile Sensor with Integrated Velocity, Force/Torque, and Pressure Map Sensing
This paper introduces a novel tactile sensor for in-hand manipulation with slip-aware control that integrates velocity, force/torque, and pressure map sensing into a single device with a deformable contact pad. To the best of our knowledge, this is the first sensor to combine these sensing modalities within a single compliant structure. The sensor features a deformable contact surface and can robustly track both flat and curved surfaces across a wide range of materials. Its performance is evaluated through a comprehensive set of experiments that highlight both its capabilities and limitations. The sensor is designed for rapid and low-cost fabrication using a combination of standard PCB manufacturing and rapid prototyping techniques.
DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World
Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at https://duobench.github.io/
Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation ICRA 2026
Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.
comment: Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures
Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection ICML 2026
Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.
comment: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)
Modular Anthropomorphic Hand Design via Multi-Parameter Finger Benchmarking and Selection
Designing anthropomorphic dexterous robotic hands remains challenging as the design space straddles morphology, actuation, and sensing properties, and performance metrics span both task-dependent and task-agnostic. Existing optimization methods are often unstructured or consider only a single performance metric, limiting systematic comparison and targeted refinement. While the design considerations of the entire hand are significant, the individual finger properties play a key role in dexterity. By developing a robotic hand platform where fingers can be modularly integrated into a full teleoperated hand, we propose that optimizing the fingers can significantly improve overall hand performance. This approach enables rapid screening of different finger-level prototypes through a number of quantitative benchmarks before their integration into the hand for task-level validation. Candidate finger designs (incorporating variations in joint, bone, skin, and sensor placement) are assessed using both mechanism-oriented and task-relevant metrics, which establish a quantitative link between component design and full hand embodiment. The framework is validated through the development of an anthropomorphic robotic hand with optimized fingers, demonstrating how these fingers enable performance improvements across tasks, including multi-object grasping and light bulb screwing.
comment: 14 pages, 13 figures. Submitted to an IEEE journal for possible publication
Human-Guided Co-Manipulation of Carbon Fiber Plies
The handling of flexible materials is a difficult task to fully automate due to the challenges caused by the deformability of these types of objects. Meanwhile, a fully manual process can be ergonomically challenging, tedious and inefficient. Thus, human-robot collaboration (HRC) and cooperative manipulation (co-manipulation) have received increasing interest in this field as they enable human involvement when needed while also improving productivity. To enable efficient co-manipulation and interaction between the human operator and the robot, different modalities and control methods are required. In this paper, we present and examine different control methods for co-manipulation of carbon fiber plies, evaluating the pros and cons of each method in a controlled setting. We propose that a multimodal combination of speech commands, wrist-tracking through vision, and force with compliant control would provide the best solution for complete and intuitive control of the task.
comment: Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning
Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.
comment: 23 pages, 6 figures
TacCoRL: Integrating Tactile Feedback into VLA via Simulation
Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at https://tac-corl.github.io/
Explore From Sketch: Accelerating UAV Exploration in Large-scale Environments with Prior Maps
Autonomous exploration with UAVs in large-scale, topologically complex environments often suffers from low efficiency due to suboptimal scheduling and detours. Prior maps (e.g., construction drawings), although usually imprecise and flawed, are readily available in many scenarios and have the potential to provide global structural guidance. This paper presents a novel exploration framework that leverages sparse, unaligned, and even discrepant 2D prior maps for LiDAR-based UAV exploration. First, a robust 2D-3D point cloud registration pipeline is proposed to align LiDAR observations with prior maps. The registration pipeline combines a GeoContext descriptor for single-frame candidate retrieval, a multi-frame verification mechanism for coarse transformation estimation with outlier rejection, and a Scale-ICP algorithm for refinement. The registration module can handle map discrepancies and provide multiple hypotheses when geometric ambiguities arise. To effectively utilize the registration results for exploration planning, we further develop a hierarchical viewpoint planning strategy under localization uncertainties. The hierarchical strategy first spatially attaches local viewpoints to prior guidepoints and adopts a Monte Carlo Tree Search solver to determine their traversal sequence under each registration hypothesis. To mitigate registration uncertainty, a risk-aware selector evaluates prior sequences using confidence-weighted travel risk, and a fixed-endpoint traveling salesman problem is formulated to generate an efficient local coverage path under the selected prior guidance. Benchmark evaluations reveal up to 34.2% improvement in exploration efficiency and 37.9% reduction in flight distance compared to state-of-the-art methods, while extensive simulations and field experiments further demonstrate robustness to prior map incompleteness and deformations.
comment: 25 pages, 22 figures
Improving Human Diving Endurance with a Field-Deployable, Untethered Exoskeleton
Human endurance in underwater locomotion is fundamentally restricted by high energetic demands to overcome drag and the finite supply of self-contained breathing gas. While exoskeleton technology can reduce the metabolic cost of humans in terrestrial locomotion, its potential to enhance human endurance during underwater diving remains entirely unexplored. Here, we present DiveMate, a field-deployable, untethered exoskeleton designed to improve human diving endurance via adaptive kick assistance in real-world underwater environments. During naturalistic diving, DiveMate increases the travel distance using a given energy (breathing gas) by 42.9% and extends dive duration by 54.9% through reducing gas consumption rate. Marked reductions in muscle activation indicate a decrease in physiological exertion, with the net gas consumption rate decreasing by 47.0%. Kinematic characteristics and regularity improvements further underpin efficient energy economy. These results suggest that applying exoskeleton assistance is beneficial for improving human diving endurance and augmenting their ability to explore the aquatic world. This study extends the application frontier of exoskeletons and provides a potential reference for the design and assessment of future underwater assistive devices.
DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace
Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.
comment: 23 pages, 6 figures, 11 tables. Code available at https://github.com/bayizeremarius/DroneShield-AI
SAFER-Nav: Enhancing Safety for Visual Robot Navigation via Segmentation-Aware Fine-Tuning
Vision-based navigation models, particularly foundation models, generate viable trajectories from RGB observations alone. However, even state-of-the-art transformer- and diffusion-based policies struggle to generalize in unfamiliar deployment environments containing unseen obstacles or shifted conditions. The resulting trajectories often remain goal-directed but unsafe. Existing efforts improve safety through external trajectory correction or internal geometric priors, yet the resulting policies are not trained to explicitly represent obstacle boundaries or traversable free-space structure. To address this, we propose a navigation model that incorporates these structures directly into the policy via fine-tuning and is designed to be compatible with diverse RGB-based backbones. Across multiple robot platforms, indoor environments, and static and dynamic obstacle scenarios, our method reduces collision frequency relative to ViNT, NoMaD, and their CARE-augmented variants while maintaining goal-reaching performance.
LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition
The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: https://lucid-robot.github.io/.
Distortion-Resilient Robotic Imitation Learning for Autonomous Cable Routing
The rapid development of intelligent control methodologies has endowed robots with powerful autonomous intelligence. Cable routing, a ubiquitous foundational task in industry, provides a rigorous benchmark for robotic dexterity and sequential decision-making. In these practical scenarios, image observation distortion frequently occurs. Samples characterized by low-quality image observations often hinder accurate model training, posing challenges to the reliability and accuracy of intelligent control systems. Nevertheless, no dedicated intelligent control solution has been proposed for scenarios of image signal distortion. Meanwhile, image quality information has not been sufficiently exploited to further enhance the performance of intelligent control methodologies. To this end, we propose a novel robotic imitation learning framework that comprises an image quality assessment module, a confidence-based learning mechanism, and a decision-making module, which is designed to maintain high performance even under distorted image observations. In the proposed framework, the image quality assessment module synergizes with the confidence-based learning mechanism to enhance the efficacy of the decision-making module. Specifically, the image quality assessment module is incorporated to extract image quality information from image observations, while the confidence-based learning mechanism adaptively prioritizes challenging samples to improve learning effectiveness. The decision-making module determines appropriate discrete skills or continuous actions. Experimental results demonstrate that our formulated framework enhances the overall performance of the decision-making module.
ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models
Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.
Cross-Modal Benchmarking for Robotic Perception in Natural Environments ICRA
Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.
comment: Accepted to the IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026
Adversarial Attacks on Learned Policies for Surgical Robotic Tasks
Learning-based policies are being considered to augment the dexterity of human surgeons in robot-assisted surgery. Can the end-to-end mapping from visual observations to robot actions be vulnerable to adversarial attacks, potentially leading to patient injury? In this paper, we present the first study of adversarial threats to learning-based policies in surgical robotics. We investigate two threat modes: (a) disruptive attacks, where imperceptible visual perturbations interrupt policy execution, and (b) steering attacks, where such perturbations steer policy actions toward attacker-specified directions. We formulate three adversarial attack methods, each with increasing access to policy information, and evaluate their impact on two surgical subtasks: debridement and suturing. Our evaluation covers three end-to-end policy architectures: ACT, Diffusion Policy, and Pi0. In addition, we introduce a new class of photometric adversarial attacks that mimic natural visual changes, such as lighting variations, to generate effective yet visually plausible perturbations. Results from 560 physical experiments using phantoms for debridement and suturing suggest that state-of-the-art policies can be significantly disrupted, resulting in an average 61% reduction in surgical subtask success rates. Project page: https://sites.google.com/view/adversary-surgery
Learning Object Manipulation from Scratch via Contrastive Interaction
Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: IWR-arxiv.github.io.
Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation
Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.
EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows
Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.
comment: 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io
EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence
In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation MDM 2026
Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.
comment: 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)
Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning
Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.
comment: Accepted to the 23rd IFAC World Congress, 2026
DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems
Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations
Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.
From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation
Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.
G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation
Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.
comment: The implementation is available at: https://github.com/chart-research/g-mapp
Foresight: Iterative Reasoning About Clues that Matter for Navigation
Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight
comment: 22 pages, 10 figures, 3 tables
Action-Effect Memory Pretraining for Robot Manipulation
We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.
$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.
comment: 34 pages, 20 figures, 9 tables
Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration
Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.
SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation
Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.
Bimanual Robot Manipulation via Multi-Agent In-Context Learning
Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models ICML 2026
Vision-Language-Action (VLA) policies often fail under distribution shift, suggesting that decisions may depend on spurious visual correlations rather than task-relevant causes. We formulate visual-action attribution as an interventional estimation problem. Accordingly, we introduce the Interventional Significance Score (ISS), an interventional masking procedure for estimating the causal influence of visual regions on action predictions, and the Nuisance Mass Ratio (NMR), a scalar measure of attribution to task-irrelevant features. We analyze the statistical properties of ISS and show that it admits unbiased estimation, and we characterize conditions under which action prediction error provides a valid proxy for causal influence. Experiments across diverse manipulation tasks indicate that NMR predicts generalization behavior and that ISS yields more faithful explanations than existing interpretability methods. These results suggest that interventional attribution provides a simple diagnostic approach for identifying causal misalignment in embodied policies.
comment: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation
Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction frameworks largely follow a one-way master-apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human-to-human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds the FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $ρ\approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: https://linusnep.github.io/SIL/.
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation
Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.
Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty
Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.
Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning
The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.
DynaRetarget: Dynamically-Feasible Retargeting using Sampling-Based Trajectory Optimization
In this paper, we introduce DynaRetarget, a complete pipeline for retargeting human motions to humanoid control policies. The core component of DynaRetarget is a novel Sampling-Based Trajectory Optimization (SBTO) framework that refines imperfect kinematic trajectories into dynamically feasible motions. SBTO incrementally advances the optimization horizon, enabling optimization over the entire trajectory for long-horizon tasks. We validate DynaRetarget by successfully retargeting hundreds of humanoid-object demonstrations and achieving higher success rates than the state of the art. The framework also generalizes across varying object properties, such as mass, size, and geometry, using the same tracking objective. This ability to robustly retarget diverse demonstrations opens the door to generating large-scale synthetic datasets of humanoid loco-manipulation trajectories, addressing a major bottleneck in real-world data collection.
Consensus-based optimization (CBO): Towards Global Optimality in Robotics
Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.
ActionMap: Robot Policy Learning via Voxel Action Heatmap
Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://showlab.github.io/ActionMap/.
iPack: Intuitive Bin Packing with Large Language Models
Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.
comment: 7 Pages, 9 Figures
Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/
comment: 8 pages, 11 figures
Continual Quadruped Robots Coordination via Semantic Skill Discovery
Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.
comment: 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/
Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters ICRA 2026
Autonomous visual interception of agile aerial targets is challenging due to unpredictable target motion, limited sensing, and the strong coupling between target visibility and interceptor maneuverability. Most existing strapdown-camera interception methods preserve visibility using conic line-of-sight (LOS) constraints that keep the target near the image center. While safe, such symmetric constraints unnecessarily restrict maneuverability and can significantly reduce the usable thrust for pursuit. Motivated by the observation that aggressive FPV pilots do not maintain equal visibility margins in all image directions, this paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance framework for autonomous interception using a lifting-wing quadcopter equipped with only a strapdown monocular camera. PS-LOS tightly constrains lateral image error while relaxing longitudinal image error within a safe field-of-view margin, preserving visibility while releasing maneuverability for acceleration-intensive pursuit. Under the lifting-wing quadcopter model, PS-LOS provides nearly 50% more available thrust near the LOS direction than conventional conic LOS constraints. To realize LOS-only interception without direct depth measurements, a delay-compensated state-estimation framework and a nonlinear guidance-and-control architecture are developed for lifting-wing quadcopters. Extensive outdoor flight experiments demonstrate autonomous interception of agile targets exhibiting large-amplitude, high-frequency, and unpredictable motion under real wind disturbances. The proposed system achieves successful interceptions at ranges up to 138 m while maintaining continuous visual tracking throughout the engagement. The results validate PS-LOS as a visibility-preserving, maneuverability-aware guidance framework for long-range visual interception of agile aerial targets.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA 2026). Recipient of the ICRA 2026 Best Paper Award in Field and Service Robotics
Learning Ordinal Response Policies in Rank-Based Stochastic Prize-Collecting Games
The Team Orienteering Problem (TOP) generalizes many real-world multi-agent scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-agent systems, they assume that all the agents cooperate toward a single objective; therefore, they do not extend to settings when they compete in reward-scarce environments. We propose Stochastic Prize-Collecting Orienteering Games (SPCOG) as an extension of the TOP to plan in the presence of self-interested agents operating on a graph, under energy constraints and stochastic transitions. A theoretical discussion on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCOGs that coincides with the optimal routing solution of an equivalent TOP under rank-based conflict resolution. We propose the concept of Ordinal Rank (OR) as a concise representation of an agents' global rank and its location within a topological, well-defined neighborhood. Empirical evaluations conducted on real-world, road-network graphs under both dynamic and stationary prize distributions show that in parameter-sharing settings, the policies that leverage local information can outperform those policies leverage global information when the former is conditioned on the OR rather than the global rank, indicating that the OR acts as a strong inductive bias in multi-agent games on graphs. The OR-conditioned policies also generalize much better to games with large number of agents compared to global-rank conditioned policies. Finally, we also propose we propose Fictitious Ordinal Response Learning (FORL) as an entropy-regulated algorithm to obtain convergent policies in independent-learning settings in prize-collecting games on graphs.
Closing the Motion Execution Gap: From Semantic Motion Task Constraints to Kinematic Control IJCAI 2026
This paper addresses the Motion Execution Gap, the disconnect between high-level symbolic task descriptions using semantic constraints and executable robot motions. Motion Statecharts are introduced as an executable symbolic representation for complex motions. They allow the arbitrary arrangement of motion constraints, monitors or nested statecharts in parallel and sequence. World-centric motion specification and generalization across embodiments are enabled through the use of a unified differentiable kinematic world model of both, robots and environments. Motion execution is realized through a lMPC-based implementation of the task-function approach, in which smooth transitions during task switches are ensured using jerk bounds. Cross-platform transferability was demonstrated by deploying the method on eight robot platforms, operating in diverse environments. The proposed framework is called Giskard and is available open source: https://github.com/cram2/cognitive_robot_abstract_machine.
comment: 9 pages, 8 figures, to be published in IJCAI 2026
PIGEON: VLM-Driven Object Navigation via Points of Interest Selection
Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.
The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning
We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at https://midigap.cs.uni-freiburg.de.
comment: Submitted for publication to IEEE Transaction on Robotics
LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping
Multi-robot collaboration is becoming increasingly critical and presents significant challenges in modern robotics, especially for building a globally consistent, accurate map. Traditional multi-robot pose graph optimization (PGO) methods ensure basic global consistency but ignore the geometric structure of the map, and only use loop closures as constraints between pose nodes, leading to divergence and blurring in overlapping regions. To address this issue, we propose LEMON-Mapping, a loop-enhanced framework for large-scale, multi-session point cloud fusion and optimization. We re-examine the role of loops for multi-robot mapping and introduce three key innovations. First, we develop a robust loop processing mechanism that rejects outliers and a loop recall strategy to recover mistakenly removed but valid loops. Second, we introduce spatial bundle adjustment for multi-robot maps, reducing divergence and eliminating blurring in overlaps. Third, we design a PGO-based approach that leverages refined bundle adjustment constraints to propagate local accuracy to the entire map. We validate LEMON-Mapping on several public datasets and a self-collected dataset. The experimental results show superior mapping accuracy and global consistency of our framework compared to traditional merging methods. Scalability experiments also demonstrate its strong capability to handle scenarios involving numerous robots.
CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision
Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.
comment: Submitted to NAVIGATION: Journal of the Institute of Navigation
EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations
Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, RMSE and standard deviations of distance estimation up to 15,3% in three tested scenarios. Based on the test results, the EKF fusion-based approach increases the depth detection range by reducing the errors outside the optimal depth camera working range. It also shows improved robustness and precision in challenging conditions, such as reflections and poor visibility, making it suitable for SAR.
comment: This work has been submitted to the IEEE for possible publication
CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents
Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination
World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
Embodied Agents are evolving from passive reasoning systems into active executors that interact with tools, robots, and physical environments. Once an agent gains execution authority, the central challenge shifts from how to make it act to how to keep its actions governable at runtime. Existing approaches embed safety, recovery, and decision constraints inside the agent loop, making execution control difficult to standardize, audit, and adapt across environments. We propose a runtime governance framework for policy-constrained execution that separates agent cognition from execution oversight. Governance is externalized into a dedicated runtime layer performing policy checking, capability admission, execution monitoring, rollback, and human override. We formalize the control boundary among a persistent Embodied Agent, modular Capability Packages, and the governance layer, and define a policy-constrained execution pipeline evaluated under controlled simulation. Over 1000 randomized trials, the framework achieves 96.2%+/-2.7% interception of unauthorized actions, reduces unsafe continuation from 100% to 22.2%+/-3.1% under runtime drift, and attains 90.7%+/-3.0% recovery success with full policy compliance. Comparison with five baselines, including AutoRT-style constitution filtering and RoboGuard-style two-stage guardrails, shows that pre-execution filtering is equally effective across governance-aware methods, while only the proposed framework provides continuous runtime detection (RVDR = 61.3% vs. 0%) and structured recovery (all p<0.001). A sensitivity sweep across the full detection range confirms a genuine detection-continuation trade-off. This work argues future embodied systems should be designed for governable execution.
comment: 36 pages, 3 figures, 10 tables
TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation
Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines.
Phase-Based Multi-Gait Learning for a Salamander-Like Robot
Salamander-like robots are designed inspired by the skeletal structure of their biological counterparts. However, existing controllers cannot fully exploit these morphological features and largely rely on predefined patterns or joint trajectories, which prevents the generation of diverse and flexible gaits and limits their applicability in real-world scenarios. In this paper, we propose a phase-based learning framework that enables the robot to acquire a diverse repertoire of gaits without using reference motions. Each body part is controlled by a phase variable capable of forward and backward evolution, with a phase coverage reward to promote the exploration of the leg phase space. Additionally, morphological symmetry of the robot is incorporated via data augmentation, improving sample efficiency and enforcing both motion-level and task-level symmetry in learned behaviors. Extensive experiments show that the robot successfully acquires 22 representative gaits exhibiting both dynamic and symmetric movements, demonstrating the effectiveness of the proposed learning framework.
SR-LIO++: LiDAR-Inertial Odometry and Quantized Mapping with Caching-Aware Sweep Reconstruction
Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on resource-constrained platforms. To address these limitations, we introduce SR-LIO++, an innovative LIO system capable of achieving doubled output frequency relative to input frequency on resource-constrained hardware platforms, including the Raspberry Pi 4B. Our system employs the previously proposed sweep reconstruction methodology to enhance LiDAR sweep frequency, generating high-frequency reconstructed sweeps. Building upon this foundation, we propose a caching mechanism for intermediate results (i.e., surface parameters) of the most recent segments, effectively minimizing redundant processing of common segments in adjacent reconstructed sweeps. This method decouples processing time from the traditionally linear dependence on reconstructed sweep frequency. Furthermore, we present a quantized map point management based on index table mapping, significantly reducing memory usage by converting global 3D point storage from 64-bit double precision to 8-bit char representation. This method also converts the computationally intensive Euclidean distance calculations in nearest neighbor searches from 64-bit double precision to 16-bit short and 32-bit integer formats, reducing computational cost. Extensive experimental evaluations across three distinct computing platforms and four public datasets demonstrate that SR-LIO++ maintains state-of-the-art accuracy while substantially enhancing efficiency. Notably, our system successfully achieves 20 Hz state output on Raspberry Pi 4B hardware.
comment: 18 pages, 10 figures
RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning
Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.
CU-Multi: A Dataset for Multi-Robot Collaborative Perception
A central challenge for multi-robot systems is fusing independently gathered perception data into a unified representation. Despite progress in Collaborative SLAM (C-SLAM), benchmarking remains hindered by the scarcity of dedicated multi-robot datasets. Many evaluations instead partition single-robot trajectories, a practice that may only partially reflect true multi-robot operations and, more critically, lacks standardization, leading to results that are difficult to interpret or compare across studies. While several multi-robot datasets have recently been introduced, they mostly contain short trajectories with limited inter-robot overlap and sparse intra-robot loop closures. To overcome these limitations, we introduce CU-Multi, a dataset collected over multiple days at two large outdoor sites on the University of Colorado Boulder campus. CU-Multi comprises four synchronized runs with aligned start times and controlled trajectory overlap, replicating the distinct perspectives of a robot team. It includes RGB-D sensing, RTK GPS, semantic LiDAR, and refined ground-truth odometry. By combining overlap variation with dense semantic annotations, CU-Multi provides a strong foundation for reproducible evaluation in multi-robot collaborative perception tasks.
comment: 8 pages, 11 figures. arXiv admin note: text overlap with arXiv:2505.17576
Triangle Splatting SLAM
We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.
comment: 26 pages, 11 figures
DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.
From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning
Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.
comment: 27 pages, 20 figures, 9 tables, conference
ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection
Surface electromyography (sEMG) is a promising control signal for assist-as-needed hand rehabilitation after stroke, but detecting intent from paretic muscles often requires lengthy, subject-specific calibration and remains brittle to variability. We propose a healthy-to-stroke adaptation pipeline that initializes an intent detector from a model pretrained on large-scale able-bodied sEMG, then fine-tunes it for each stroke participant using only a small amount of subject-specific data. Using a newly collected dataset from three individuals with chronic stroke, we compare adaptation strategies (head-only tuning, parameter-efficient LoRA adapters, and full end-to-end fine-tuning) and evaluate on held-out test sets that include realistic distribution shifts such as within-session drift, posture changes, and armband repositioning. Across conditions, healthy-pretrained adaptation consistently improves stroke intent detection relative to both zero-shot transfer and stroke-only training under the same data budget; the best adaptation methods improve average transition accuracy from 0.42 to 0.61 and raw accuracy from 0.69 to 0.78. These results suggest that transferring a reusable healthy-domain EMG representation can reduce calibration burden while improving robustness for real-time post-stroke intent detection. Our project website, video, code, and dataset are available at: https://roamlab.github.io/reactemg-stroke/.
Multiagent Systems
CCKS: Consensus-based Communication and Knowledge Sharing
In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at https://github.com/yuanxpy/CCKS.
Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework
Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templates, struggle to traverse multi-hop reasoning chains or resolve latent spatial dependencies across multiple building entities. To address these challenges, a Spatial-Geometric Reasoning System for Building Information Modeling (SGR-BIM) is proposed as an integrative graph-driven reasoning framework. SGR-BIM dynamically constructs a cross-modal knowledge graph that aligns user intent, regulatory semantics, and BIM geometry, enabling interpretable reasoning without rigid hard-coding. Validated on 679 expert-verified queries from fire safety codes, the framework achieves 84.3% accuracy, representing an 8.6% improvement over enhanced-tool single-agent baselines. This research provides a graph-based semantic reasoning paradigm, enhancing the transparency and flexibility of automated geometric compliance check workflows in the Architecture, Engineering, and Construction (AEC) industry.
Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator
Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism's success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.
Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure
Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, consensus protocols, and audit logs -- either enforce static, context-unaware permissions or merely record actions post-execution. This paper introduces the Sovereign Assurance Boundary (SAB), a certificate-bound runtime admission layer for autonomous execution authority. SAB intercepts agent proposals at an assurance airlock, compiles them into typed execution contracts $C$, and binds these contracts to cryptographic evidence digests $H(E)$ and policy versions. The contracts are then routed through consequence-aware certification paths. Upon successful admission, the system emits a signed Sovereign Assurance Certificate ($Ω$) that is strictly scoped to a specific execution identity, revocation epoch, and validity window. Finally, a sovereign execution broker verifies $Ω$ and performs fresh pre-execution revocation and drift checks before invoking infrastructure APIs. We detail the airlock-broker architecture, formalize its admission and revocation invariants, and report preliminary feasibility measurements from a Go prototype evaluated over 2,500 admission attempts. Ultimately, this broker-enforced model prevents autonomous reasoning from directly mutating state, transforming delegated execution authority into a cryptographically verifiable, evidence-bound, revocable, and replayable runtime artifact.
comment: 12 pages, 1 figure, 13 tables
Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows ICML 2026
As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.
comment: 16 pages (4 are main text), 2 figures, 6 tables. Accepted to the AIWILD Workshop at ICML 2026
SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems
LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.
WorkBench Revisited: Workplace Agents Two Years On
The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.
comment: 8 pages, 3 figures. Follow-up to arXiv:2405.00823
Bimanual Robot Manipulation via Multi-Agent In-Context Learning
Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.
Multidimensional Manhattan Preferences
A preference profile (i.e., a collection of linear preference orders of the voters over a set of alternatives) with $m$ alternatives and $n$ voters is $d$-Manhattan (resp. $d$-Euclidean) if both the alternatives and the voters can be placed into a $d$-dimensional space such that between each pair of alternatives, every voter prefers the one which has a shorter Manhattan (resp. Euclidean) distance to the voter. We study how $d$-Manhattan preference profiles depend on the values $m$ and $n$. First, we provide explicit constructions to show that each preference profile with $m$ alternatives and $n$ voters is $d$-Manhattan whenever $d \ge \min(n, m - 1)$. We further extend this positive result for other $p$-norms with $p \in R_{\ge 1} \cup \{\infty\}$. Second, for $d = 2$, we develop forbidden substructures-preference patterns among small sets of voters that constrain any 2-Manhattan embedding -- and use them to show that the smallest non-2-Manhattan preference profile has either 3 voters and 6 alternatives, or 4 voters and 5 alternatives, or 5 voters and 4 alternatives. This is more complex than the case with $d$-Euclidean preferences (see (Bogomolnaia and Laslier, 2007) and (Bulteau and Chen, 2022)). We also show that $d$-Manhattan preferences imply $(2d-1)$-dimensional single-peakedness, while 2-Manhattanness is incomparable with single-peakedness and single-crossingness.
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
Collective decision-making with higher-order interactions on $d$-uniform hypergraphs
Understanding how group interactions influence opinion dynamics is fundamental to the study of collective behavior. In this work, we propose and study a model of opinion dynamics on $d$-uniform hypergraphs, where individuals interact through group-based (higher-order) structures rather than simple pairwise connections. Each one of the two opinions $A$ and $B$ is characterized by a quality, $Q_A$ and $Q_B$, and agents update their opinions according to a general mechanism that takes into account the weighted fraction of agents supporting either opinion and the pooling error, $α$, a proxy for the information lost during the interaction. Through bifurcation analysis of the mean-field model, we identify two critical thresholds, $α_{\text{crit}}^{(1)}$ and $α_{\text{crit}}^{(2)}$, which delimit stability regimes for the consensus states. These analytical predictions are validated through extensive agent-based simulations on both random and scale-free hypergraphs. Moreover, the analytical framework demonstrates that the bifurcation structure and critical thresholds are independent of the underlying topology of the higher-order network, depending solely on the parameters $d$, i.e., the size of the interaction groups, and the quality ratio. Finally, we bring to the fore a nontrivial effect: the large sizes of the interaction groups, could drive the system toward the adoption of the worst option.
Continual Quadruped Robots Coordination via Semantic Skill Discovery
Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.
comment: 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage points, while LLM-authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage), producing structured improvement briefs that require no ground-truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28\% relative over unimproved LLM skills and closes 47--67\% of the gap to human-authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe-built skill library learns from past agent trajectories and raises pass rate from 16.0\% to 52.0\% using only 22 skills.
comment: 9 pages, under review
Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL
We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).
Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning ICAPS 2026
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
comment: 29 pages. Extended version of the paper accepted to ICAPS 2026
MARIC: Multi-Agent Reasoning for Image Classification
Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.
comment: 11 pages, preprint
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.
comment: 8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables
Systems and Control (EESS)
FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning
Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at https://jasonjzliu.com/factr2
comment: Website at https://jasonjzliu.com/factr2
Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles
Accurate identification of hydrodynamic derivatives is essential for control and navigation of Unmanned Surface Vehicles (USVs), but high-fidelity manoeuvring data from physical sea trials are constrained by cost and safety. Turning Circle (TC) and Zig-Zag (ZZ) trials remain fundamental to IMO and ITTC assessment procedures. This paper extends the Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres, with traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. Another feature is explicit command-execution separation for differential-thrust steering, where inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case-study results demonstrate repeatable and compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9 percent between port and starboard sides, while the tactical diameter differs by approximately 4.6 to 4.7 percent. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/- 10 degree and +/- 20 degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 deg/s. Overall, the framework provides a repeatable and auditable virtual sea-trial workflow for generating IMO/ITTC-aligned datasets and supporting system identification, hydrodynamic-derivative estimation, and digital-twin calibration.
Analysis of a Distributed Optimization-Based Control Architecture for Inverter-Interfaced Virtual Power Plants
We develop a large-signal stability analysis for a sampled-data, optimization-based secondary controller for inverter-interfaced distributed energy resources in virtual power plants.
comment: 9 pages, no figures
From the Linear Quadratic Regulator (LQR) to the (Deterministic) Kalman Filter in Two Easy Steps
This note is a tutorial on the deterministic version of the Kalman filter (state estimator), which is formulated as finding the state trajectory consistent with the system's equations with the minimal amount of $L^2$ process and measurement uncertainty. As stated, this is an input signal design problem with linear dynamics and an objective that is affine-quadratic in the state and inputs. The first step is to convert this problem to one with a purely quadratic objective by embedding in a larger system using ``homogeneous coordinates''. This converts the problem to a purely quadratic (i.e. an LQR) problem, but with non-standard initial or final state constraints. This latter problem can then be solved using a version of the matrix Differential Riccati Equation (DRE) for the larger LQR problem. The second step is a partitioning of this larger problem, which then yields the optimal dynamic observer and the DRE of the traditional Kalman filter. For comparison, the solution of the traditional LQ-tracking (Servomechanism) problem is also treated using a similar construction.
Lexicographic optimization for real-time CNC feedrate planning with coupled orientation handling
Optimization-based feedrate planning offers the potential to significantly increase machining productivity, but its industrial adoption has been limited by high computational cost and extensive tuning effort. This paper proposes a lexicographic feedrate optimization principle that adaptively balances finishing time and motion smoothness in a tuning-free manner. To further improve computational efficiency, the optimization scheme is extended by a sparsity-exploiting formulation combined with a sequential windowing strategy, enabling real-time capable execution. In addition, a unified toolpath parameterization scheme is incorporated to synchronously handle tool position and orientation within the optimization framework. For a five-axis freeform test contour, the proposed method takes 14 s on an Intel i5-3470 CPU to optimize feedrate profiles for long toolpaths with 100,000 constraint checkpoints, and 52 s on a high-performance AMD 9950X CPU to handle one million checkpoints. Compared to an industrial CNC kernel, the resulting finishing time is reduced by more than 15 %.
On the Dynamics and State Dependent Multiple Equilibria of a Post-Buckled Ultra-Flexible Inverted Pendulum on a Rotating Hub
Compliant element systems with ultra-large deformation display rich nonlinear dynamics and pose challenging control problems, which, when solved, could enable enhancements in several mechatronics applications, such as soft robotics, MEMS, and biomedical applications. This paper considers post-buckled dynamic analysis of an inverted ultra-flexible pendulum actuated by a rotary hub. We first derive a complete set of equations capturing the dynamics of the system, essential for control development, using the assumed modes method framework, considering ultra-large deformations. Constrained Lagrange formulation is used for the same. In the perfect inverted configuration with zero hub angle, the buckled beam would display two symmetric stable equilibria and one unstable. However, as the hub angle changes on either side, the equilibrium positions shift, and eventually two of them vanish, and we are left with only one stable equilibrium. We use the dynamic equations to characterize this interesting phenomenon, demonstrating the continuous state dependence of multiple equilibria. Furthermore, experimental counterparts of the equilibrium results are meticulously obtained and discussed. Moreover, simulation results capture the nonlinear dynamics of this system. Overall, the work establishes a solid mathematical foundation with a control-amenable model for futuristic ultra-compliant mechatronic systems.
comment: 9 pages, 8 figures, Appendix
Zero Knowledge Verification of Transaction Guides for P2P Energy Trading in Distribution Networks
Peer-to-peer (P2P) energy trading requires network-aware coordination because transactions are physically realized through distribution networks. However, sensitivity-based coordination causes a confidentiality-verifiability tradeoff, as network sensitivities may reveal vulnerable components while undisclosed sensitivities prevent participants from verifying utility-provided transaction guides. This paper proposes a zero-knowledge-proof-based method for verifying the computational integrity of network-constrained transaction guides with respect to committed private network data, without exposing network-sensitivity information. The guide defines admissible injection and withdrawal volumes derived from sign-decomposed sensitivity matrices while satisfying balance, voltage, line-flow, and optimality conditions. These conditions are encoded in an arithmetic circuit, represented as R1CS constraints and a quadratic arithmetic program, and verified using a bilinear pairing. Blockchain commitments bind the approved circuit, public inputs, statement identifiers, proof, and verification result for tamper-evident auditability. The proposed proof certifies correct guide computation from committed network data; the authenticity of the committed network data is handled through an explicit registration and attestation assumption. Case studies on a modified IEEE 33-bus system show satisfaction of network constraints after clearing, rejection of public-input and witness-inconsistency attacks, and practical on-chain overhead, with an 806-byte proof.
comment: 10 pages
Deep Reinforcement Learning for Adaptive Power Allocation in ISAC Systems with Mobile Target
In this paper, we study the power allocation for an integrated sensing and communication (ISAC) system which tracks a mobile target. We first model the problem as a Markov decision process, and then tackle it with a soft actor-critic (SAC) based deep reinforcement learning (DRL) approach. We also combine a Dirichlet policy, which naturally produces normalized continuous actions under random target motion. To exploit different features of sensing and communication operations, we carefully design a reward function such that the system can dynamically control power allocation to conserve resources. The simulation results demonstrate that the proposed scheme enhances tracking performance compared to other baselines while sustaining communication performance.
Physics-guided residual Kalman learning for state-of-charge estimation of lithium iron phosphate batteries
Accurate state of charge (SOC) estimation of lithium iron phosphate (LFP) batteries remains challenging because of their flat open-circuit-voltage (OCV)-SOC characteristics, temperature-dependent dynamics, and sensitivity to initialization errors. Here, we propose a physics-guided residual Kalman learning (PRKL) framework for electrochemical-model-based SOC estimation. PRKL combines a control-oriented single-particle-model-based extended Kalman filter (EKF), which provides recursive physical state propagation, with a gated recurrent unit (GRU) residual learner that compensates structured EKF errors using electrochemical states and measurement features. The framework is evaluated on a public graphite/LFP dataset covering three dynamic drive cycles, eight temperatures from -10 to 50 degrees C, and initialization offsets up to 20 percent. Using dynamic stress test (DST) and federal urban driving schedule (FUDS) cycles for training and the supplemental federal test procedure (US06) cycle for cross-profile testing within the same cell dataset, PRKL achieves a global average root mean square error (RMSE) of 1.19 percent, corresponding to a 77 percent reduction relative to the physics-only EKF. These results show that electrochemical state information can guide residual learning and improve recursive SOC estimation for LFP batteries. The present validation supports cross-profile robustness within the studied dataset and provides a basis for future cross-cell, ageing-aware, and embedded-platform validation.
comment: 36 pages, 4 figures. Author accepted manuscript. Accepted for publication in Journal of Energy Chemistry, published by Elsevier. Final version of record available at DOI: 10.1016/j.jechem.2026.05.040
Robust Zonotopic Control
We propose a zonotopic framework for synthesizing a single robust state feedback controller that is certified to stabilize every plant inside a matrix zonotope, describing linearly varying parameters or parametric uncertainty. Common robust design strategies rely on checking many vertex models or on complex gain-scheduling, leading to high offline computation and implementation complexity. Our approach finds a single gain that is provably valid across the entire parameter domain, which is simpler to implement and can reduce conservatism by exploiting the structure of the zonotope. We formulate the robust synthesis as a single convex program tailored to the zonotope representation and incorporate practical performance requirements (actuator constraints, decay rate, disturbance attenuation) into the same synthesis stage. In numerical experiments on a representative 4-state example, our controller provides larger stability coverage across the parameter domain, attains comparable transient performance and control effort to more complex designs, and significantly reduces the number and scale of offline synthesis problems required by other robust approaches, compared to common-vertex gain, $H_{\infty}$, and $μ$-synthesis baselines.
comment: accepted at CCTA 2026
Cooperative Switched Formation Control of Autonomous Vehicles: An Event-triggered Approach to Input Saturation and Time-delay Challenges
This paper presents a collaborative adaptive formation control framework for autonomous vehicles (AVs), that explicitly handles system uncertainties, input saturation, and communication delays. To overcome the inherent physical torque limits of steering and braking actuators, an input saturation compensation mechanism is introduced to render nonlinearities tractable and improve control reliability. Additionally, a delay-compensating auxiliary system is designed to mitigate the effects of communication delays and reduce tracking errors. Our framework incorporates a dynamic-threshold event-triggered control (ETC) strategy to optimize resource usage. Additionally, uncertainty observers and symmetric barrier Lyapunov functions are developed to ensure robust and safe formation maneuvers. Finally, the effectiveness of the proposed approach is validated through numerical simulations of vehicle formations, complemented by a 3D visualization video demonstrating the dynamic fleet reconfiguration process.
Robust Tuning of Model Predictive Control for MMC-Based High-Voltage Power Systems
High-voltage direct current (HDVC) transmission systems based on modular multilevel converters (MMCs) have become a key topology in modern power systems. The dynamics of MMCs exhibit strong multivariable coupling, constraints, and uncertainties, motivating the use of model predictive control (MPC) to enhance current regulation performance. However, MPC tuning is nontrivial and does not inherently guarantee stability or robustness, particularly in the presence of model uncertainties. This paper proposes a MPC tuning method that ensures robust performance under bounded model uncertainties. This method solves a convex linear optimization problem to compute the optimal weighting matrices Q, R, and P ensuring optimality and reproducibility. As a result, robustness is enhanced without increasing the online computation burden. The effectiveness of the method is validated through testing on a real-time digital simulator (RTDS) model of a point-to-point HVDC system. Results demonstrate improved performance compared to conventional LQR-based MPC tuning.
comment: Submitted to IEEE Transactions on Power Delivery
Risk-Aware AoII-Based Scheduling with Hybrid Transmission for a Semi-Markov Source
We consider a multi-receiver status update system in which a transmitter monitors a finite-state semi-Markov source and decides whether to stay idle, unicast an update, or broadcast a common update. We formulate a risk-aware scheduling problem that minimizes the long-term average sum of the average Age of Incorrect Information (AoII), average risk ratio, and transmission cost. The risk state is defined by whether the AoII exceeds a prescribed threshold. We solve the problem using model-based and model-free policies and compare them with two baselines. Numerical results show that the proposed policies outperform the baselines, exploit both unicast and broadcast transmissions, and capture the effect of the dwell-time law on scheduling performance.
CBF-based Driving Assistance for Traffic Flow Stabilization
This manuscript addresses a hierarchical control system designed to suppress traffic congestion. The lower-layered controllers, implemented in each controlled vehicle, monitor microscopic vehicle behaviors and assist human drivers to ensure sufficient spacing for following vehicles. This spacing logic is designed based on the Control Barrier Function. Meanwhile, the upper-layered controller monitors the macroscopic traffic flow and activates the necessary lower-layered controllers, using a data-driven approach for the activation logic design. Furthermore, the effectiveness of the proposed control system is evaluated in a traffic flow simulation environment constructed using real-world traffic data.
comment: 6 pages, 6 figures. Submitted to IFAC CPHS 2026
Koopman-based NMPC for Virtually Coupled Train Control System
This paper investigates an analytical Koopman-based nonlinear model predictive control (K-NMPC) approach for tracking control of virtually coupled train systems. A nonlinear train movement model incorporating train dynamics, speed and control input limits, passenger comfort constraints, and collision avoidance is systematically lifted into a finite-dimensional Koopman space through closed-form observable functions. After freezing the affine parameter-varying lifted predictor along the shifted predicted trajectory, the online optimal control problem is solved as a quadratic program that can be solved efficiently. The proposed KNMPC is benchmarked against a time-discrete NMPC scheme, demonstrating comparable control performance with significantly reduced online computation time and strong potential for real-time implementation in practical virtually coupled train control systems.
comment: to be presented at IFAC World Congress 2026
Comparative Evaluation of Transition Mechanisms for Adaptive Droop Gains in Parallel Grid-Forming Inverters
Uncertainty in standalone microgrid operation usually originates from mismatches between power references and forecasts. These deviations are compensated by grid-forming controlled units, which distribute the required power contribution based on their droop gains. To introduce an additional degree of flexibility, it is possible to treat droop gains as decision variables to redistribute active-power contributions according to system-level objectives. However, directly applying updated droop gain references from a supervisory layer to the primary controllers can introduce power and frequency transients. This paper investigates transition mechanisms for applying scheduled active-power droop gain changes during operation. Hard switching, rate-limited transition, first-order IIR low-pass filtering, and cubic as well as quintic S-curve transitions are compared experimentally on two parallel 15 kW grid-forming inverter units. The results show that shaping the droop gain trajectory significantly reduces transient deviations compared to hard switching. In the considered case study, the S-curve transitions provide the strongest transient mitigation, reducing the active-power overshoot from 632.7 W to approximately 115 W and limiting the frequency overshoot to about 0.003 Hz.
Violation-Informed Spatio-Temporal Adaptive Targeting Framework for EV-Driven Distribution System Expansion Planning
The rapid adoption of electric vehicles (EVs) can cause severe voltage drops and line current overloads in distribution networks, creating an urgent need for scalable expansion planning methods. This paper proposes a computationally efficient violation-informed spatio-temporal adaptive targeting (STAT) framework for EV-driven distribution system expansion planning. The framework first identifies potential voltage and current violations through a violation analysis model, and then mitigates them through a joint optimal expansion planning model that co-optimizes investment decisions for line reconductoring, shunt capacitors, and battery energy storage systems. To reduce computational burden, the proposed STAT-temporal criticality assessment (STAT-TCA) method extracts primitive stress events from annual operating data, derives an initial set of candidate planning horizons from signature-consistent segments, and selects a final transferable critical horizon set through cross-horizon validation based on optimization feasibility and cost. Meanwhile, the proposed STAT-adaptive spatial targeting (STAT-AST) method constructs device-specific spatial features for BESS and SC siting to retain compact yet high-impact candidate bus sets. Case studies on 33-bus and 240-bus distribution systems demonstrate that the proposed STAT framework can substantially reduce the temporal and spatial planning dimensions while preserving planning fidelity. Full-year validation further confirms that the resulting investment plans can eliminate EV-induced voltage and thermal violations while maintaining feasible BESS operations.
Model-Based and Data-Driven Hierarchical Control and Topology Co-Design for Robust Networked Systems
In this paper, we consider a class of networked systems comprising an interconnected set of linear subsystems, disturbance inputs, and performance outputs. Using dissipativity theory, we first propose a model-based hierarchical control design strategy to ensure the closed-loop networked system is dissipative from its disturbance inputs to performance outputs. This involves designing local controllers for each subsystem to enforce local dissipativity guarantees, which are then exploited to co-design distributed global controllers and the interconnection topology to enforce global dissipativity guarantees while optimizing interconnection topology costs. The overall design process requires only solving a sequence of linear matrix inequality (LMI) problems, thereby retaining compositionality and decentralizability while avoiding non-convex, iterative design processes that are inefficient and centralized. This model-based hierarchical control design strategy assumes the knowledge of the subsystem dynamics, which may not hold in many real-world networked systems. Motivated by this, we also propose a data-driven hierarchical control design strategy that assumes only the availability of rich input-state-output trajectory data from the subsystems. The proposed data-driven design process assumes that the unknown disturbances affecting the subsystem dynamics are bounded by a quadratic matrix inequality (relaxing conventional bounds) and accounts for this by using the matrix S-lemma. Finally, the effectiveness of the proposed model-based and data-driven hierarchical control designs is illustrated for a networked system representing a DC microgrid, with the aim of enforcing robust (dissipative) voltage regulation and current sharing.
comment: To be submitted to Automatica
A High-Precision Clock Synchronization System for the CEPC Accelerator
The Circular Electron Positron Collider (CEPC) distributes a reference clock distributed to 192 control nodes along its 100~km underground tunnel. The required synchronization precision is 30~ps (standard deviation). We present an enhanced White Rabbit (WR)-based clock synchronization system designed to meet this requirement. A noise-budget analysis of the standard WR slave loop identifies the analog actuation chain (DAC + VCXO + multiplier PLL) and restart-induced timing uncertainty as the dominant limitations. In our redesigned node, the DAC+VCXO chain is replaced by a Si5345A DSPLL clock generator with DCO-based phase control, removing the board-level analog tuning stage. GTX transceiver phase alignment and manual byte-alignment fixing reduce restart uncertainty from 88.8~ps to 12~ps peak-to-peak. For multi-node operation, we introduce a cascaded global-control architecture with PC-side PID auto-tuned by TD3 reinforcement learning, on-chip-temperature feed-forward calibrated to $-0.76\,\mathrm{ps}/^\circ\mathrm{C}$. The measured point-to-point synchronization precision is 3.38~ps over 1~m fiber and 3.92~ps over 50~km. In a 12-level cascade, the end-node precision reaches 6.66~ps at constant temperature and 7.30~ps under a 13$\,^\circ$C temperature swing. Synchronized-clock TIE jitter stays below 1~ps regardless of cascade depth. Restart uncertainty is 2.82~ps (std.\ dev.). A 4-level cascade operated stably for 25 hours of continuous monitoring. All measured metrics fall well within the CEPC 30~ps budget.
comment: 23 pages,17 figures
Large Language Models in Process Systems Engineering: Opportunities, Architectures, and Industrial Deployment Challenges
Large Language Models (LLMs) have rapidly emerged as tools of interest across engineering disciplines, and Process Systems Engineering (PSE) is no exception. This survey provides a systematic review of LLM applications in PSE, organizing the literature into seven categories: (1) process design and engineering, (2) molecular design and synthesis, (3) process modeling and simulation, (4) time-series forecasting, (5) optimization and scheduling, (6) process control, and (7) fault detection and diagnosis. For each category, we summarize the state of the art, identify common methodological approaches, and critically assess demonstrated capabilities versus aspirational claims. We find that LLMs show genuine promise for tasks involving natural language, including querying documentation, synthesizing unstructured knowledge, and enabling flexible human-machine interaction. However, applications requiring real-time execution, constraint satisfaction, or formal safety guarantees remain challenging. We conclude by identifying open problems and productive research directions for the PSE community.
Beyond Resilience -- A Conceptual Framework for Civic Ascent
The resilience literature measures urban performance as recovery: the degree to which a city returns to its pre-shock baseline. This paper develops a stronger concept -- civic ascent -- as part of a broader research program on the ethology of coupled agent-environment systems, of which the city is the deepest available empirical instance. Civic ascent is defined as the condition in which a city emerges from shock with higher functional capacity than before. We develop a conceptual framework in the ethological tradition, treating the city as a coupled system of three slow state variables -- topos (physical structure), nomos (institutional structure), and hexis (civic judgment) -- together with a fast affective channel (delta) through which shocks to topos and nomos reach hexis. The framework distinguishes three structurally distinct pressures on civic systems: shocks (discontinuities in T or M), decay (continuous entropy), and leakage (active extraction of civic surplus into non-civic pools). The ascent condition is that reinforcement from cross-coupling of T, M, and H exceeds the combined loss from decay and leakage. Post-shock ascent is measured by a normalised improvement index A(T) applied to a composite civic performance signal P(t) constructed from scale-adjusted key performance indicators, distinguishing intrinsic civic ascent from demographically driven growth. New York City after September 11, 2001, is proposed as the primary empirical case; the operational measurement program is specified in the companion NYC Civic Data Map (Washburn 2026c, 133 KPIs) and executed in Paper 2. The reader for whom only the urban contribution is of interest will find it complete in itself; the reader interested in the larger program will find this paper its formal core.
Storage and Transport Capacity Design for a Self-Reliable Two-Node Stochastic Resource System
We study a two-node stochastic resource system operating over a finite horizon. Each node experiences uncertain supply and demand and is equipped with finite storage. The objective is to ensure that resource levels remain within prescribed limits with high probability. To this end, we formulate a chance-constrained capacity-design problem in which resources can be exchanged through a capacity-limited transport link. We characterize the minimum storage required at each node, derive the optimal transport policy, and quantify the trade-off between storage and transport capacities. Our results show the existence of a critical transport-capacity threshold that enables full risk pooling between the nodes. Moreover, this threshold decreases with the operating horizon, implying that full-pooling performance can be achieved with progressively smaller transport capacity over longer horizons.
comment: 9 pages, 4 figures
Polymer-based Capacitive Micromachined Transducer-Enabled Inline Monitoring of Ultrasonic Welding in Thermoplastic Carbon Fiber Composites
Thermoplastic composite structures enable lightweight, recyclable, and high-throughput aerospace manufacturing, but reliable quality assurance of advanced joining processes remains a key challenge. This work presents a compact, low-cost, and wireless ultrasonic non-destructive testing system for real-time, inline monitoring of continuous ultrasonic welding of thermoplastic carbon fiber composites. The system integrates custom-fabricated polymer-based capacitive micromachined ultrasonic transducers (polyCMUTs) with the ultra-low-power WULPUS platform, enabling operation in the harsh, high-interference welding environment. An eight-element linear polyCMUT array operating at a center frequency of approximately 3.6 MHz is designed, fabricated, packaged, and integrated into an industrial welding setup. Inline measurements are performed during welding of carbon fiber laminates with intentionally introduced defects. Process-synchronous ultrasonic data reveal consistent depth-of-echo shifts at defect locations, in strong agreement with X-ray computed tomography ground truth. Across 21 welds, all induced defects are detected without false negatives and with limited false positives. The results demonstrate that polymer-based CMUT technology enables robust, scalable, and manufacturing-compatible ultrasonic sensing, representing a decisive step toward intelligent process monitoring and quality assurance for next-generation thermoplastic composite welding.
comment: 15 pages, 12 Figures
Two-Layer Linear Auto-Regressive Models Estimate Latent States ICML 2026
Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.
comment: ICML 2026
Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites
Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.
comment: 34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)
Modeling and Estimation of Solid Electrolyte Interphase during Formation in Battery Manufacturing
The solid electrolyte interphase (SEI) - a critical passivation layer that governs the longevity, safety, and efficiency of lithium-ion batteries - is created during the last step in cell manufacturing called cell formation. Conventional cell formation protocols are largely empirical, resulting in long processing times and limited control over the SEI growth rate that influences SEI quality and lifetime performance. This paper develops a control-oriented, semi-empirical model to estimate SEI thickness growth from terminal voltage and cell expansion measurements acquired in-operando during manufacturing using low-cost micrometer-precision integrated-sensing fixture. Model parameters are calibrated against cell formation data, and an unscented Kalman filter is employed to estimate the SEI film growth. The results lay the foundation for future closed-loop control of SEI growth, enabling high-quality and more efficient formation processes.
comment: 8 pages, 6 figures. Accepted by the 2026 American Control Conference (ACC)
Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning
Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.
comment: Accepted to the 23rd IFAC World Congress, 2026
Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment
Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts
On the Nonexistence of Continuous Immersions for Discrete-time Systems
Understanding when linear immersions of nonlinear dynamical systems exist is important since such immersions allow us to leverage the rich tools of linear system theory to analyze nonlinear dynamics. Recently, Liu et al. (2023) showed that continuous-time dynamical systems that admit countably many but more than one omega-limit sets cannot be immersed into finite dimensional linear systems with a one-to-one and continuous mapping. In this paper, we extend these results to discrete-time dynamics and show that similar obstructions exist also in discrete time. We further consider a generalization involving alpha-limit sets. Several examples are provided to demonstrate the results.
comment: Copyright 2026 the authors. This work has been accepted to IFAC 2026 for publication under a Creative Commons License CC-BY-NC-ND
Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning
The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.
Optimal Microgrid Sizing of Offshore Renewable Energy Sources for Offshore Platforms and Coastal Communities
The global energy landscape is undergoing a transformative shift towards renewable energy and advanced storage solutions, driven by the urgent need for sustainable and resilient power systems. Isolated offshore communities, such as islands and offshore platforms, which traditionally rely on mainland grids or diesel generators, stand to gain significantly from renewable energy integration. Promising offshore renewable technologies include wind turbines, wave and tidal energy converters, and floating photovoltaic systems, paired with a storage solution like battery energy storage systems. This paper introduces a renewable energy microgrid optimizer (REMO), a tool designed to identify the optimal sizes of renewable generation and storage resources for offshore microgrids. A key challenge in such models is accurately accounting for battery degradation costs. To address this, the REMO model integrates a deep neural network-based battery degradation (DNN-BD) module, which factors in variables like ambient temperature, charge/discharge rates, state of charge, depth of discharge and battery health. Simulations on six test regions demonstrate that the REMO-DNN-BD approach minimizes lifetime energy costs while maintaining high reliability and sustainability, making it a viable design solution for offshore microgrid systems.
WAKE-NET: A 3D-Wake-Aware Economic Turbine Layout and Cabling Optimization Framework for Multi-Capacity Multi-Hub-Height Wind Farms Serving Grid-Scale and Industrial Power Systems
The global transition towards renewable energy has accelerated the deployment of utility-scale wind farms, increasing the need for accurate performance and economic assessments. Although wind energy offers substantial potential for carbon emission reduction, investment decisions are highly sensitive to predicted annual energy production and economic profitability. Conventionally wind farm analyses often estimate turbine power output based solely on incoming wind conditions, neglecting wake interactions between turbines. These wake effects can significantly reduce downstream turbine performance, leading to overestimation of energy yield and financial returns. This study proposes WAKE-NET, a 3D wake-aware optimization framework that integrates turbine layout optimization, turbine capacity selection, cable routing, and hub height diversification within a unified profit-driven formulation. Unlike traditional approaches that assume a uniform hub height and turbine capacities or ignore wake dynamics, the proposed framework accounts for wake-induced power losses during optimization. A benchmark wake-ignorant model is also evaluated to quantify the impact of neglecting wake interactions. Results indicate that the wake-ignorant optimization can significantly overestimate annual profits, while the use of multiple hub heights and capacities reduce wake overlap and improve spatial utilization. Overall, the findings demonstrate that wake-aware optimization coupled with hub height and capacity diversification provides more reliable energy yield prediction and economic assessment, offering valuable guidance for large-scale wind farm planning and investment.
Solvability of the Output Corridor Control Problem by Pulse-Modulated Feedback
The problem of maintaining the output of a positive time-invariant single-input single-output system within a predefined corridor of values is treated. For third-order plants possessing a certain structure, it is proven that the problem is always solvable under stationary conditions by means of pulse-modulated feedback. The obtained result is utilized to assess the feasibility of patient-specific pharmacokinetic-pharmacodynamic models with respect to patient safety. A population of Wiener models capturing the dynamics of a neuromuscular blockade agent is studied to investigate whether or not they can be driven into the desired output corridor by clinically acceptable sequential drug doses (boluses). It is demonstrated that low values of a parameter in the nonlinear pharmacodynamic part lie behind the detected model infeasibility.
comment: shortened version will be presented at IFAC World Congress 2026, Busan, Korea
Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies ICML 2026
Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
comment: ICML 2026 (Spotlight); Code: https://github.com/azizanlab/ReverseFlowMatching
Reachability for Low-Thrust Trajectories via Maximum Initial Mass
Reachability analysis plays a central role in low-thrust spacecraft trajectory optimization by identifying which target states can be achieved under constraints on time, thrust, and propellant. Classical approaches construct reachable sets by solving many optimal control problems over grids of terminal states, requiring extensive forward simulations with fixed initial conditions. While effective, this approach is computationally expensive and becomes impractical for high-dimensional systems or strongly nonlinear dynamics, such as those encountered in cislunar environments or solar sail missions. This work introduces a dual formulation of the reachability problem. Instead of computing reachable sets directly, we determine, for fixed transfer time and boundary conditions, the maximum allowable initial mass (or, for solar sails, a scalar sail-strength parameter) that permits a successful transfer. A target is reachable if the spacecraft's initial mass does not exceed this threshold. This reformulation reduces reachability assessment to a scalar optimization problem for each target, producing a smooth scalar field that encodes equivalent feasibility information to classical reachable sets. We develop indirect maximum-initial-mass (MIM) formulations for both electric low-thrust and solar-sail dynamics and show how they can serve as efficient reachability oracles. Building on this formulation, we construct data-driven surrogate models to approximate the MIM-based reachability indicator. We investigate fully connected neural networks and demonstrate that residual networks provide the best trade-off between accuracy, training stability, and model complexity. The resulting surrogates enable rapid reachability evaluation while preserving the numerical advantages of the dual formulation, offering a practical tool for preliminary mission design and feasibility assessment.
comment: Presented at the 30th International Symposium on Space Flight Dynamics, 1-5 June 2026, Toulouse, France
Shaping Energy Exchange with Gyroscopic Interconnections: a Geometric Approach
Gyroscopic interconnections enable redistribution of energy among degrees of freedom while preserving passivity and total energy, and they play a central role in controlled Lagrangian methods and IDA-PBC. Yet their quantitative effect on transient energy exchange and subsystem performance is not well characterised. We study a conservative mechanical system with constant skew-symmetric velocity coupling. Its dynamics are integrable and evolve on invariant two-tori, whose projections onto subsystem phase planes provide geometric description of energy exchange. When the ratio of normal-mode frequencies is rational, these projections become closed resonant Lissajous curves, enabling structured analysis of subsystem trajectories. To quantify subsystem behaviour, we introduce the inscribed-radius metric: the radius of the largest origin-centred circle contained in a projected trajectory. This gives a lower bound on attainable subsystem energy and acts as an internal performance measure. We derive resonance conditions and develop an efficient method to compute or certify the inscribed radius without time-domain simulation. Our results show that low-order resonances can strongly restrict energy depletion through phase-locking, whereas high-order resonances recover conservative bounds. These insights lead to an explicit interconnection-shaping design framework for both energy absorption and containment control strategies, while taking responsiveness into account.
comment: Conference paper submitted to the 10th IEEE Conference on Control Technology and Applications (CCTA) 2026 In Vancouver, and is currently under review
Subsystem Structure as an Inferential Resource for Coupled Engineered Systems
Engineered infrastructure systems pose inverse problems in which hidden states, unknown parameters, and subsystem couplings must be inferred from sparse and noisy measurements. These problems are difficult because physical subsystems are heterogeneous, sensing is partial, uncertainty is distributed across subsystem interfaces, and computational cost grows rapidly with system size. We address this challenge with probabilistic compositional inference, a graph-based architecture that represents a coupled system as interacting subsystems, each retaining its own local model, estimator, and uncertainty representation, while coupling is handled through physically meaningful stochastic messages exchanged across subsystem interfaces. This formulation allows mechanistic, learned, and deterministic components to coexist within a single inference framework and propagates calibrated uncertainty without assembling a global augmented state or covariance. We validate the framework in three increasingly demanding settings: a sparse-sensing canonical inverse problem, where interface couplings can also be learned from data; infrastructure-scale power networks, where the method matches centralized joint state-and-parameter inference while reducing computational scaling from approximately cubic to approximately linear; and a multi-physics turbine embedded in a power-grid network, where heterogeneous subsystems compose hierarchically without degrading local inference or collapsing local posteriors into a global estimate. Together, these results show that subsystem structure can be exploited as the organizing principle for uncertainty-aware inverse inference in coupled engineered systems.
QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS
In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.
comment: This article will be submitted to the Transactions journal
Enhancing Energy and Spectral Efficiency in IoT-Cellular Networks via Active SIM-Equipped LEO Satellites
This paper investigates a low Earth orbit (LEO) satellite communication system enhanced by an active stacked intelligent metasurface (ASIM), mounted on the backplate of the satellite solar panels to efficiently utilize limited onboard space and reduce the main satellite power amplifier requirements. The system serves multiple ground users via rate-splitting multiple access (RSMA) and IoT devices through a symbiotic radio network. Multi-layer sequential processing in the ASIM improves effective channel gains and suppresses inter-user interference, outperforming active RIS and beyond-diagonal RIS designs. Three optimization approaches are evaluated: block coordinate descent with successive convex approximation (BCD-SCA), model-assisted multi-agent constraint soft actor-critic (MA-CSAC), and multi-constraint proximal policy optimization (MCPPO). Simulation results show that BCD-SCA converges fast and stably in convex scenarios without learning, MCPPO achieves rapid initial convergence with moderate stability, and MA-CSAC attains the highest long-term spectral and energy efficiency in large-scale networks. Energy-spectral efficiency trade-offs are analyzed for different ASIM elements, satellite antennas, and transmit power. Overall, the study demonstrates that integrating multi-layer ASIM with suitable optimization algorithms offers a scalable, energy-efficient, and high-performance solution for next-generation LEO satellite communications.
Robotics
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation
Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control by incorporating tactile or force feedback, but they rarely model the asymmetric spatiotemporal roles of global force and local tactile sensing. To address this, we propose TacForeSight, a lightweight force-conditioned tactile foresight framework for real-time manipulation. The core component is TacForceWM, a tactile world model that predicts short-horizon tactile latent dynamics from dual-finger tactile observations conditioned on high-frequency wrist force and torque signals. Another key component, the Predictive Tactile-Conditioned Policy, leverages the predicted latents as anticipatory contact priors, models the current-to-future tactile evolution via cross-attention, and adaptively fuses visuo-tactile features through a tactile-guided gating module. By forecasting purely within a compact latent space, TacForeSight enables proactive contact reasoning with efficient real-time inference suitable for high-frequency manipulation control. Real-robot experiments on five representative tasks and three in-process perturbation settings show that TacForeSight consistently outperforms existing baselines, particularly under dynamic contact disturbances. All models and datasets will be made publicly available on the project website at https://tacforesight.github.io/ProjectPage.
JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation
Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percentage of everyday tasks such as opening a jar, pouring a liquid, lifting a tray, or basic meal preparation, is fundamentally bimanual and remains out of reach for any single-arm system. Adding a second arm to a wheelchair is impractical, due to the additional power draw, cost, and the loss of space required for transfers and mobility. We instead propose a heterogeneous, on-demand bimanual system, in which a wheelchair-mounted anchor arm is joined when needed by a summoned mobile manipulator that serves as a complement arm. The central technical problem, which we call bimanual joining, is conditional: the anchor has already committed to a grasp, and the complement arm must choose where to stand and what to grasp to complete the task. We formulate bimanual joining as a three-phase decomposition (plan, drive, grasp) and show that a vision-language model (VLM), coupled with standard geometric tools, provides task-level knowledge sufficient to solve a representative class of bimanual ADLs. Our system JOIN, contributes (i) a wheelchair-referenced opposition score, and (ii) task-conditioned directional manipulability. We evaluate JOIN on a Kinova Gen3 anchor and a Hello Robot Stretch~3 complement on representative same-object and different-object tasks. JOIN accomplished more attempts (19/20) than state-of-the-art methods (14/20) and required markedly less correction by the operator.
comment: Xiang Zhi Tan and Taşkın Padır share equal advising
EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid Robots
Falls are one of the leading causes of injury and hospitalization among elderly individuals, making reliable fall awareness an essential capability for safety monitoring in residential environments. However, existing fall detection systems often rely on wearable devices or fixed sensing installations, which may suffer from low user compliance, limited spatial coverage, or degraded performance under occlusion and poor lighting conditions. In this work, we propose \textbf{EM-Fall}, an embodied fall detection framework deployed on a mobile humanoid robot. The system integrates millimeter-wave (mmWave) sensing with robotic mobility, allowing the robot to actively adjust its sensing viewpoint and maintain target observability across rooms and under occlusion. To address interference in complex residential environments, including pet motion and multipath artifacts, we design a human-centered perception pipeline combined with lightweight temporal modeling to capture motion evolution before, during, and after fall events. We evaluate the proposed system across eight real indoor environments with four participants and construct an in-home mmWave fall detection dataset. Experimental results show that the embodied mobile sensing paradigm improves monitoring continuity and maintains robust fall detection performance under diverse environmental conditions. The proposed framework provides a practical solution for robot-assisted safety monitoring in home environments.
RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning
Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: $\href{https://opendrivelab.com/RoboNaldo}{\text{opendrivelab.com/RoboNaldo}}$.
A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments
Robust and efficient cooperative exploration with multiple unmanned ground vehicles (UGVs) in unknown, GPSdenied, and bandwidth-limited environments without prior maps remains challenging, as localization drift degrades map consistency and induces redundant coverage. This paper presents a fully distributed exploration framework that couples descriptoraided inter-UGV loop closure with loop-aware hierarchical planning while enabling autonomous localization and exploration. We develop a lightweight LiDAR global descriptor with range-image prealignment to enable robust cross-UGV place recognition under large yaw and lateral variations, and use verified loop closures to maintain globally consistent trajectories and a sparse topological representation. We further introduce an uncertainty-aware crossUGV loop-closure selection module that scores candidate loop closures under pose uncertainty and retains high-utility loop closures as planning anchors for global task allocation and local route refinement. Simulations and real-UGV experiments show that the loop-closure module achieves AR@1/AR@1% of 89.9%/95.5%, distributed optimization reduces absolute trajectory error, the system substantially reduces two-way communication volume, and the overall framework reduces exploration time and travel distance by 15% and 14%, respectively, compared with an mTSP baseline.
Generation of Diverse and Functional Robot Designs using Superquadrics Parametrisation and Quality-Diversity PPSN 2026
Generative design of robots requires navigating a vast search-space, encompassing physical configurations and behavioural parameters. Evolutionary Algorithms (EAs) have shown promising results, but often converge prematurely to a small set of sub-optimal designs. Most EAs fail to maintain sufficient diversity in the population that would allow the discovery of distinct functional robots. To counter premature convergence, we introduce a superquadrics-based representation (SQs) for robot bodies. SQs are interpretable, compact and computationally efficient mathematical representations of 3D geometrical shapes that can be tuned to specific design-spaces. To encourage morphological diversity, we combine this representation with a quality-diversity (QD) algorithm (MAP-Elites). We compare SQs and Compositional Pattern Producing Networks representations as generators of morphologies, combining them with standard EAs and MAP-Elites. In two test environments, we find that using SQs to generate morphology in conjunction with the MAP-Elites algorithm reaches the highest QD-score across both environments, maximising diversity of design and functionality of generated robots. The findings highlight the benefits of using a compact and interpretable geometric representation for exploring a complex design-space and suggest that combining SQs with an explicit diversity mechanism increases the quality and number of designs generated.
comment: Accepted at PPSN 2026
A Spiking Neural Architecture for Coordinating Arm and Locomotor Control
Spiking Neural Networks (SNNs) coupled with neuromorphic hardware offer energy-efficient solutions for humanoid robot control. However, existing SNN-based motor control systems address bipedal locomotion and arm control in isolation, leaving integrated control of both unaddressed. We present a spiking architecture that coordinates force-based arm control and bipedal locomotion in a simulated humanoid, using the Neural Engineering Framework (NEF) and Semantic Pointer Architecture (SPA). High-level action selection between locomotor and arm control is mediated by a biologically grounded spiking basal ganglia model. We validate the system through co-simulation of Nengo, for the neural control, and Isaac Sim, demonstrating successful target reaching, continuous digit drawing, path-following locomotion, and finally, switching between walking and arm control via basal ganglia disinhibition. To our knowledge, this is the first integrated spiking controller to combine bipedal locomotion and arm control on a full-scale humanoid platform. The full spike-based implementation enables future deployment on low-power neuromorphic hardware.
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving CVPR2026
Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.
comment: CVPR2026
Multi-UAV Active Sensing with Information Gain-based Planning and Belief Fusion
Unmanned aerial vehicles (UAVs) are increasingly used for active sensing and information gathering in spatially distributed environments. Their performance, however, is constrained by limited flight time, sensing uncertainty, and the trade-off between spatial coverage and observation accuracy. This paper presents a real-world validation of a multi-UAV active sensing framework for probabilistic binary terrain mapping, with precision agriculture used as the application case. The environment is represented as a probabilistic belief map, where spatial dependencies are modeled through a factor-graph formulation. UAV decision making is guided by Information Gain based Informative Path Planning (IGbIPP), and the approach is compared with Random Walk and Sweep coverage path planning baselines using both synthetic terrains and real UAV-derived agricultural imagery. The study also evaluates spatial correlation weights and several probabilistic belief-fusion rules for multi-UAV information sharing. Results show that IGbIPP reduces entropy and mapping error more effectively than the baselines, while a wider field of view improves real-world coverage and map accuracy. The results further show that simple equal or biased spatial weights can be more robust than adaptive weights, and that Bayesian, log-odds, and Dempster--Shafer fusion achieve the best cooperative mapping performance. These findings highlight the importance of uncertainty-driven planning, sensing geometry, spatial modeling, and probabilistic fusion for real-world UAV-based active sensing.
Language-Driven Cost Optimization for Autonomous Driving SC
The driving behavior of autonomous vehicles is typically governed by the cost function of their motion planner, which encodes objectives such as speed tracking, smoothness, lane keeping, and collision avoidance. However, tuning the parameters that shape this cost function is a challenging task that requires technical expertise, limiting the vehicle's ability to adapt to evolving traffic scenarios or end-user preferences. This work presents a language-driven framework for adaptive cost design in autonomous driving. A Large Language Model (LLM) interprets structured scenario descriptions and natural language user queries to generate the parameters applied to a risk-aware Model Predictive Path Integral (MPPI) controller. The system incorporates a human-in-the-loop validation stage in which the proposed behavioral changes are described in non-technical language and confirmed prior to deployment. Users may additionally provide feedback either before or after deployment, enabling iterative refinement of the vehicle's motion behavior. The framework is evaluated across multiple queries in realistic driving scenarios to assess its effectiveness. Simulation results demonstrate that the method successfully induces behavioral changes that align with the intended requirements in an intuitive manner, thereby bridging the gap between intelligent vehicle control systems and end users.
comment: Paper accepted at IEEE Intelligent Transportation Systems Conference (ITSC) 2026
Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection
Precise state estimation for navigation of autonomous agricultural robots is often compromised by sensor outages (GNSS/LiDAR/Visual) and high-frequency vibrations inherent in off-road environments. This paper proposes a robust navigation algorithm based on a jerk-augmented Extended Kalman Filter (EKF) integrated with a Multiple Tuning Factor (MTF) adaptation method. Unlike standard EKF approaches that assume constant measurement noise, our method dynamically adjusts the measurement covariance matrix in real-time, allowing the system to cope with sudden disturbances and sensor outliers. We evaluate the algorithm using real-world data from a Salin247 autonomous robot. Results demonstrate that jerk-augmentation combined with MTF adaptation significantly reduces 3D position Root Mean Square Error (RMSE) compared to baseline EKF models, providing superior dead-reckoning capabilities.
AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning
Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.
comment: Project Page: https://bagh2178.github.io/AllDayNav/
Task Robustness via Re-Labelling Vision-Action Robot Data
The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity.
comment: Project website: https://akuramshin.github.io/tread
AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation
Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent on retraining or active depth hardware. This paper introduces AgniNav, a configuration-driven local navigation framework that standardizes cross-embodiment transfer at the collision-envelope level. Each robot is specified by a measurable four-parameter safety envelope: collision-relevant height, front length, rear length, and half width. The height parameter conditions an image-to-scan network to predict a one-dimensional, collision-relevant pseudo-laserscan from a monocular color image, while the remaining footprint parameters configure a dimension-aware local planner for collision checking. Training uses height-conditioned column-minimum scan labels generated from paired color-depth data, allowing the same image to supervise different safety envelopes without collecting robot-specific data. To the best of our knowledge, AgniNav is the first monocular local-navigation framework that jointly conditions perception and planning on a shared collision-envelope configuration for zero-retraining deployment across wheeled, quadruped, and humanoid platforms. Real-robot experiments on a Turtlebot2, Unitree Go2, and Accelerated Evolution K1 achieve 39/40, 18/20, and 18/20 successes with 0/40, 1/20, and 2/20 collisions, respectively, while running at 30 Hz on Jetson Orin.
MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation
Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.
comment: 14 pages,9 figures
Embodiment-conditioned Generalist Control for Multirotor Aerial Robots
We present a generalist position control policy capable of controlling arbitrary multirotor configurations of a certain rotor count (e.g., hexarotors or quadrotors) with a single set of network weights. The policy is conditioned on a physics-grounded embodiment descriptor: a mass and inertia-normalized control allocation matrix that captures how mass-normalized motor thrusts generate linear and angular accelerations in the body-frame. To train the policy, we sample from a broad distribution of arbitrary multirotor configurations, including non-planar and asymmetric systems, and optimize a single, compact network using Proximal Policy Optimization. Training requires only five minutes on an RTX 3090 GPU using a custom NVIDIA Warp-based dynamics simulator. Through extensive simulation experiments, we show that embodiment conditioning enables robust generalist control across arbitrary morphologies. We demonstrate zero-shot real-world transfer of this generalist policy on three diverse hexarotor systems, including a planar robot, a partially symmetric non-planar system, and a random asymmetric, non-planar configuration.
An Exposure-Time-Aligned Primary-Path Architecture for Autonomous-Driving ECUs
While end-to-end (E2E) autonomous driving has become the dominant research direction, production vehicles continue to rely on modular multi-NN pipelines for a non-trivial transitional period. The subject of this paper is the design of an architecture that, during this phase, supports a modular pipeline and an E2E path side by side and embeds a path for staged migration. Transplanted to a production SoC, egalitarian late fusion is compute-inefficient and offers no natural unit for staged E2E substitution. As an alternative, we propose three design principles: (i) Primary-Path, which explicitly selects a primary perception chain and prioritizes its enclosure within a single SoC pair over the non-critical paths (ii) Exposure-Time-Aligned, which propagates the primary sensor's exposure time $τ_{\rm exp}$ as a tag along the chain and event-drives the fusion node on matched $τ_{\rm exp}$ rather than a fixed cycle and (iii) Co-Path Coexistence, which, building on (i) and (ii), lets an E2E output path co-run with the modular pipeline within the same $τ_{\rm exp}$ cycle. On a Dual-SoC production AD-ECU, the implementation closes camera-shutter to planner-output latency at a mean of 296 ms within the 350 ms design budget. Under (iii), the modular pipeline is primary at production launch and the E2E path runs as shadow on real vehicles, and the E2E scope is expanded as evaluation evidence accumulates.
Gradient based Bilevel for Inverse Optimal Control, a Riemannian approach
Inverse Optimal Control (IOC) aims to recover the cost function that explains observed trajectories as solutions of an optimal control problem. Classical IOC formulations rely on bilevel optimization, which repeatedly solves a nested optimal control problem and quickly becomes computationally prohibitive for realistic systems. Recent projection-based approaches offer a promising alternative but suffer from numerical instability when solved with gradient-based methods due to violations of standard constraint qualifications. In this paper, we show that these difficulties stem from the geometric structure of the IOC feasible set. We demonstrate that the set of trajectories satisfying the optimality conditions naturally forms a manifold and reformulate IOC as an optimization problem on this manifold. Based on this insight, we propose a Riemannian Inverse Optimal Control (RIOC) method that projects observed trajectories onto the manifold of optimal solutions while preserving feasibility by construction. Experiments on real human arm trajectories show that the proposed method achieves comparable or better reconstruction accuracy than classical bilevel IOC while reducing computation time by about a factor of four. These results highlight the potential of geometric optimization methods to improve the scalability and reliability of IOC for robotics and human motion analysis.
comment: 6 Pages, 4 Figures. To be published in a control journal
GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation
Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.
comment: https://guide-navigation.github.io/
IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation
Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.
comment: Project website: https://gao-jiawei.com/IMPACT/
Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly
Multi-pair robotic assembly in unstructured environments faces spatial interference and contact uncertainties. Existing paradigms fail to bridge cognitive decision-making and physical execution, as they either encounter state-space explosion and knowledge bottlenecks or suffer from logical hallucinations and topological conflicts. We propose an end-to-end neuro-symbolic framework that solves the challenge hierarchically: generating optimal subgraphs for each pair, decoupling generality from edge cases, and then resolving cross-pair interferences. Given an eye-on-hand RGB-D assembly scene, the framework extracts semantic instance identity and state while quantifying the scene for divergence calculation. For each pair, optimal subgraph is generated via LLM using barely basic actions to mitigate hallucinations. Supportive actions for edge cases are reasoned and inserted with a lightweight discriminator. Driven by the divergence between the quantified baseline and current scene, it is easily extensible at low cost. Augmented subgraphs are topologically coordinated into global sequences while preserving internal behavioral coherence. Dynamic behavior trees embedding atomic skills close the force-aware execution loop. Offline evaluation on 100 real-world scenes achieves 97.00% global executability, outperforming classical and state-of-the-art planners. Real-robot deployment on a UR3 arm attains 90% success rate with 0.5 mm tolerance under strong interference, demonstrating a unified and verifiable solution for complex autonomous assembly.
comment: Corresponding author: Aiguo Song (a.g.song@seu.edu.cn)
On-sky demonstration of reinforcement learning for adaptive optics control
Reinforcement learning (RL)-based algorithms have recently emerged as a promising approach for adaptive optics (AO) control. In simulations and laboratory experiments, they have demonstrated robustness to real-world effects such as photon and detector noise, misregistration, vibrations, and rapid variations in seeing conditions. However, their performance has not yet been validated on sky. We report the first on-sky demonstration of a reinforcement learning controller for adaptive optics, named Policy Optimization for AO (PO4AO). We further analyze its on-sky behavior and identify directions for improving the algorithm and its implementation.PO4AO was implemented and deployed on the Papyrus adaptive optics system installed at the Coudé focus of the 1.52 m telescope (T152) at the OHP. A Python-based implementation was interfaced with the existing real-time controller (DAO RTC) via shared-memory buffers. The performance of PO4AO was compared to that of a standard integrator controller over several nights, covering a range of flux levels and atmospheric conditions. PO4AO consistently outperformed the standard integrator in all tested configurations. The controller successfully learned and compensated for vibration patterns and demonstrated strong robustness to measurement noise. Once tuned for Papyrus, PO4AO operated in a turnkey fashion, using a single set of hyperparameters across varying observing conditions and science targets. These performance gains were achieved despite a non-optimized Python implementation introducing approximately $750\,μ\text{s}$ of additional latency, along with control jitter and occasional frame drops. When properly implemented and optimized, PO4AO constitutes a robust and high-performance turnkey controller for single-conjugate adaptive optics systems, paving the way for broader adoption of reinforcement learning strategies in on-sky AO operations.
comment: 11 pages, 12 figures accepted by A&A
ros2probe: Non-intrusive, Kernel-selective Observability for Robot Operating System 2 Middleware
Robot Operating System 2 (ROS 2), the de facto standard middleware framework for robots, runs each robot as a graph of nodes communicating over the Data Distribution Service (DDS), a publish/subscribe substrate. Observing this inter-node communication in real time is essential to robot development, yet it has a price. A tool can receive data only by joining the DDS domain as a subscriber that discovery has matched to the publisher, so observing folds the tool into the system it measures and perturbs it. We define this protocol-inherent perturbation as the observer's probe effect. It inflates the discovery plane, adds deserialization cost on the observer, makes the loss it reports diverge from what the subscriber actually received, and near saturation displaces the subscriber's messages. The only escape, capturing all wire traffic passively, discards ROS 2 message semantics and scales with total traffic, not what is observed. We present ros2probe, a non-intrusive observation framework that removes the probe effect. It reconstructs the full ROS 2 communication state from the domain's discovery packets at no bandwidth cost, then drives an in-kernel filter restricted to the topics the user asks for, lifting only those packets at minimal cost and observing what the real subscriber receives. Its interfaces and recordings match the standard ROS 2 tools. Across three hardware platforms (laptop, Jetson, and Raspberry Pi), two DDS implementations, and seven robot-operation workloads, ros2probe holds the discovery graph within 0.5% of an unobserved system, whereas domain-joining tools inflate discovery up to 2.6$\times$ and drop 38.5% of the subscriber's messages at saturation while ros2probe drops none. It reports loss with a recall of 1.0, cuts observer CPU and memory by up to 7$\times$ and 28$\times$, and stays practical on the embedded robots where existing tools overload the system.
comment: 13 pages, 8 figures, 7 tables
Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization
Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.
Pushing the Performance Limits in Autonomous Racing: Continuous Stability-Aware Adaptive Velocity Planning in Formula Student Driverless
In autonomous racing, especially in competitions such as Formula Student Driverless, precise planning of the target velocity of a race car is crucial for competitive lap times and stable driving behavior. Especially at high speeds, Velocity Planning (VP) is a significant challenge as it has to be performed in real time, taking into account track layouts, environmental influences, mechanical tolerances, and the resulting control inaccuracies. In this paper, we present a novel approach to VP that dynamically adapts to such changing conditions. Instead of estimating the physical Tire-Road Friction Coefficient (TRFC), a continuous scaling factor is inferred indirectly from vehicle stability. This factor not only reflects the effective tire-road interaction but also captures effects of control inaccuracies. From this, we generate a continuous friction map, which serves as a robust, adaptive basis for computing the optimal target speed, accounting for both vehicle and environmental limits. Our proposed approach was evaluated on a real Formula Student race car, showing a lap time improvement of 35 % over ten laps and an average increase of 8 % compared to a non-adaptive approach.
comment: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
Vehicle Prediction Model for Enhanced MPC Path Tracking in Formula Student Driverless
Autonomous race cars, such as in Formula Student Driverless, operate close to their physical handling limits. The resulting highly nonlinear vehicle behavior increases the path tracking complexity, especially on narrow tracks. Model Predictive Control (MPC) is commonly used to address this issue, a method whose performance is closely tied to the accuracy of the underlying prediction model. This paper presents a novel, real-time capable prediction model for autonomous race cars that adjusts to changing conditions by combining information from past runs and the current driving situation. Our model is divided into three consecutive submodels: a nominal Kinematic Bicycle Model, an offline Bayesian Linear Regression (BLR) model, and an online Sparse Gaussian Process Regression (SGPR) model. The proposed approach enables efficient integration of all available data without significantly increasing computational cost, ensuring high prediction accuracy and a quantitative uncertainty assessment right from the start of the run. Compared to existing approaches, an improvement in prediction accuracy of up to 57% was achieved. Further, we successfully demonstrated the practical applicability of the model within an MPC-based path tracking controller on a real Formula Student race car.
comment: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis
Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.
UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data
Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.
Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters ICRA 2026
Autonomous visual interception of agile aerial targets is challenging due to unpredictable target motion, limited sensing, and the strong coupling between target visibility and interceptor maneuverability. Most existing strapdown-camera interception methods preserve visibility using conic line-of-sight (LOS) constraints that keep the target near the image center. While safe, such symmetric constraints unnecessarily restrict maneuverability and can significantly reduce the usable thrust for pursuit. Motivated by the observation that aggressive FPV pilots do not maintain equal visibility margins in all image directions, this paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance framework for autonomous interception using a lifting-wing quadcopter equipped with only a strapdown monocular camera. PS-LOS tightly constrains lateral image error while relaxing longitudinal image error within a safe field-of-view margin, preserving visibility while releasing maneuverability for acceleration-intensive pursuit. Under the lifting-wing quadcopter model, PS-LOS provides nearly 50% more available thrust near the LOS direction than conventional conic LOS constraints. To realize LOS-only interception without direct depth measurements, a delay-compensated state-estimation framework and a nonlinear guidance-and-control architecture are developed for lifting-wing quadcopters. Extensive outdoor flight experiments demonstrate autonomous interception of agile targets exhibiting large-amplitude, high-frequency, and unpredictable motion under real wind disturbances. The proposed system achieves successful interceptions at ranges up to 138 m while maintaining continuous visual tracking throughout the engagement. The results validate PS-LOS as a visibility-preserving, maneuverability-aware guidance framework for long-range visual interception of agile aerial targets.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA 2026). Recipient of the ICRA 2026 Best Paper Award in Field and Service Robotics
Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations
Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.
LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies
Designing dynamically feasible trajectories for rigid bodies is a fundamental problem in robotics. While direct methods are widely used, the existing constrained optimizers typically operate in Euclidean space and ignore the manifold structure of rigid body motions. This mismatch may introduce singularities or lead to poorly conditioned optimization problems. To bridge this gap, we develop a structure-aware framework for constrained trajectory optimization directly on matrix Lie groups. Our approach is based on the second-order rigid body models utilizing Lie group structures, which enables efficient Newton-type updates while preserving the underlying geometry. Building on this model, we propose a line-search Lie Group Interior Point Method (LieIPM) to handle constraints on the manifolds. We instantiate the framework for rigid body motion planning using Lie group variational integrators and derive closed-form intrinsic derivatives that exploit group symmetries. The LieIPM preserves the topology of rotation motions by construction and avoids singularities. Numerical results demonstrate superior robustness and faster convergence compared to general-purpose solvers and structure-exploiting optimal control methods.
AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness
Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.
VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models ACM MM
Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.
comment: Submit to ACM MM
Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults
Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.
Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models
Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.
GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains
Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.
Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining
Large-scale SLAM remains challenging due to accumulated trajectory drift and the increasing computational cost of maintaining global consistency. Submap joining alleviates these issues by constructing locally consistent submaps and subsequently fusing them into a global map. However, existing occupancy-based submap joining methods operate on discrete grids, resulting in non-smooth gradients during optimization and neglecting the uncertainty associated with occupancy estimates. We propose the first continuous probabilistic submap joining framework that jointly optimizes submap poses and a global occupancy field in the latent log-odds space. The framework employs an information-preserving sparse Bayesian formulation that compresses raw occupancy observations into sufficient-statistic log-odds tuples while retaining the posterior information of the original observations. This yields closed-form predictive mean and variance estimates for occupancy mapping, which directly enable a submap joining formulation with analytical Jacobians, leading to more accurate submap joining and yielding a closed-form optimal global map upon pose convergence. Experiments on both simulated and large-scale real-world datasets demonstrate that the proposed method achieves higher pose accuracy and improved global consistency than state-of-the-art grid-based submap joining approaches, while producing more compact map representations and better-calibrated uncertainty estimates than existing continuous occupancy mapping methods.
comment: 12 pages, 7 figures
UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data
Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies
Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.
A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation
Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.
comment: 20 pages
HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation
World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.
Rethinking Embodied Navigation via Relational Inductive Bias
Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.
OMG: Omni-Modal Motion Generation for Generalist Humanoid Control
Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.
comment: Project Page: https://tsinghua-mars-lab.github.io/OMG/
Baseline-Free Policy Optimization for Neural Combinatorial Optimization
Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation
Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.
Improved Representation of Matrix Lie Group Operations through Tensor Notation
Several recent papers have demonstrated the utility of using Lie groups within estimation problems, yielding improved accuracy and consistency. This paper introduces a new tool for describing operations with matrix Lie groups: tensors and the Einstein summation notation. While tensors and Einstein notation are well-known in other research fields, applying this mathematical notation to represent and compute matrix Lie derivatives is novel. More importantly, this new notation greatly clarifies the derivatives and operations necessary to work with matrix Lie Groups in (gradient-based) estimation frameworks. Therefore, the main contribution of this paper is not a new capability, but a more perspicuous mathematical notation for working with matrix Lie groups.
comment: 12 pages, 4 figures + graphical abstract, 1 algorithm, 4 tables
MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds
Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction
For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.
comment: We provide video demos and code in: https://project-edith.github.io
Locomotion analysis of a quadruped interacting with the lunar granular surface
Deploying legged robots in extra-terrestrial environments includes many challenges due to complex terrain interactions, energy, and thermal constraints. For effective mechanical design of a lunar exploration quadrupedal robot, careful consideration of motor torques, energy expenditure, and cost of transport is required. The lunar surface is composed of granular regolith, which impacts the locomotion of legged robots and their performance. Locomotion algorithms trained with rigid contact assumptions are also ineffective when applied to environments with soft contacts, such as granular surfaces, which can result in instability and poor tracking. In this report, the physical modelling of the granular lunar surface-robot foot contacts is applied to a simulation environment with locomotion trained using Reinforcement Learning. A comparison is conducted between the policy trained on rigid contact and soft contact environments, analysing the gait and locomotion performance metrics. The analysis demonstrates that soft contacts simulating regolith surfaces pose additional challenges for Reinforcement Learning based training, result in a qualitatively different gait, and increase the overall energy expenditure.
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents
Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.
Steering Multirobot Behavior via Closed-Loop Affine Activation Editing
Real-world robots need to adapt their behavior beyond the envelope of their pre-trained policy. Policy finetuning or retraining are options, but they risk catastrophic forgetting, degrading the pretrained policy's base performance. To combat this, we introduce CLAE: Closed-Loop Affine Activation Editing, an inference-time framework for steering the behavior of a frozen policy by editing intermediate activations while keeping the base policy weights and downstream action head untouched. CLAE approaches behavior steering as a closed-loop problem whose outputs edit policy activations that adapt online to the robot state, environment, target behavior, and multi-robot context. It trains a sparse autoencoder over frozen-policy activations, selects behavior-relevant latent features via post-hoc probing, and learns a lightweight RL-based steering policy that applies state-dependent affine edits to selected latents during inference. We validate CLAE on a frozen multi-quadrotor navigation policy trained to perform a single task: navigating robots to a set of goal locations while avoiding obstacles. Through extensive simulations and physical tests, we show that while navigating to their goal positions, CLAE can 1. steer individual robot behavior by controlling each robot's velocity profile; 2. coordinate multirobot behavior by preserving a desired formation; and 3. produce entirely new behavior wherein robots are required to reduce their exposure to surveillance cameras in the environment.
Bridging the sim2real gap in the table tennis robot with a transformer-based ball states predictor
Robotic table tennis is a representative benchmark for high-speed, closed-loop robotic control in dynamic environments, where accurate and fast prediction of ball states is critical for reliable planning and control. Physics-based approaches rely heavily on accurate parameter identification and precise initial state, while learning-based methods often struggle to capture long-range temporal dependencies and are typically trained on limited or simulated data. We propose a transformer-based framework for table tennis ball state prediction that leverages attention mechanisms to model long-range temporal correlations directly from historical observations, without relying on explicit flight or bounce models. To support robust learning and generalization, we collected a large-scale real-world dataset from players of varying skill levels and diverse ball cannon configurations. The combination of a high-capacity transformer architecture and extensive real-world data enables accurate long-horizon forecasting. Building on this capability, we introduce a plug-and-play sim-to-real transfer strategy, Swap Predictor at Deployment (SPAD), which replaces the physics-based simulator used during training with the proposed real-world-trained predictor at deployment, improving the sim-to-real transferability of the policy without requiring retraining. We demonstrate that this simple substitution effectively narrows the sim-to-real gap while preserving the efficiency and scalability of simulation-based training.
A Modular Dual-Camera Pipeline for Micro-Inspection Using Aerial Robots
Most existing drone-based inspection systems require the drone to fly dangerously close to the target or follow complex flight paths to capture small details. In addition, drone flight is affected by disturbances and localization inaccuracies, which can cause the drone to lose sight of its supposed target when it has a narrow view. Furthermore, trajectory planning often requires prior information about the target's geometry, position, and orientation, which is not always available for non-structural targets such as trees, vehicles, or people. To address these challenges, this paper presents aerial_micro_inspection, a generic pipeline for aerial micro-inspection across different use cases. The pipeline assumes a PX4-powered drone equipped with two cameras: (i) a zoomed, gimbal-mounted inspection camera that captures fine details without requiring the drone to fly very close to the target, and (ii) a wide-field-of-view stereo navigation camera that acquires the target surface on site, estimates its range, and partitions it into smaller inspection regions. In addition, a vision-based feedback loop compensates for drone motion while the inspection camera visits small partitions of a larger surface. We evaluate the pipeline in simulation and real-world experiments, mainly in two use-case scenarios: tree inspection for detecting oak processionary caterpillars and their eggs, and greenhouse inspection of sticky traps for detecting whiteflies. The results show improved coverage robustness under drone disturbances in simulation, as well as effective detection of caterpillars and eggs and high-detail imaging of insects in real-world experiments. The pipeline is open-source, developed in ROS 2, and can be adapted to new applications by replacing the surface-segmentation and micro-target detection checkpoints. The code is available at: https://github.com/SaxionMechatronics/aerial_micro_inspection
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies
Action chunking has become a standard design in modern robot policies, from diffusion/flow policies to vision-language-action models, where the policy predicts a sequence of actions and executes a fixed number of them instead of acting one step at a time. However, this paradigm relies on a key assumption: a fixed execution horizon. During chunk execution, the policy operates open-loop, which is particularly problematic for fine-grained manipulation tasks that require frequent replanning. In practice, the execution horizon is typically chosen through empirical tuning and is highly task-dependent. To this end, we propose Dynamic Execution Horizon Prediction (DEHP), an effective method that trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. Across our evaluations, DEHP improves the success rate of different high-precision and long-horizon manipulation tasks by a large margin. Our qualitative analysis further shows that DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion. In this way, DEHP balances the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control. Project page: https://dehp-chunking.github.io/
PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation
Dexterous manipulation with multi-finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large-scale data collection with known parameter values, simulation-trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially-observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re-training or fine-tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero-shot transfer of our simulation-trained policy and outperform state-of-the-art offline reinforcement learning and world-model-augmented behavior cloning baselines. Please see our website at https://plume-world-model.github.io for videos.
comment: 16 pages, 5 figures
HiPi: Reproducible High-Fidelity Piezoresistive Sensors for Robotic Manipulation
Piezoresistive tactile sensors are attractive for robotic manipulation because they are thin, lightweight, low-cost, and scalable to dense large-area sensing. However, existing systems still face a practical trade-off: recent reproducible designs emphasize accessibility and ease of reproduction, whereas high-fidelity readout architectures remain more difficult to fabricate, assemble, and deploy. We present HiPi, a reproducible high-fidelity piezoresistive sensing system for robotic manipulation. Building on a low-crosstalk readout principle, HiPi redesigns the complete hardware stack around reproducibility, deployability, and multi-sensor scalability. The system includes a compact readout PCB compatible with commercial PCB fabrication and assembly services, eliminating manual soldering; a smaller and lower-cost STM32-based MCU module; an optimized communication pipeline that achieves 220 Hz readout in a bimanual setup with four dense tactile arrays (2048 taxels in total); and FPCB-based conductive layers that simplify sensor fabrication and stacking. Experiments with structured 3D-printed contact patterns show that HiPi preserves contact geometry substantially better than a reproducible baseline, improving the average IoU from 0.428 to 0.797 and the average Dice score from 0.539 to 0.886. These results suggest that HiPi bridges an important gap between reproducible fabrication and high-fidelity readout, making dense piezoresistive tactile sensing more practical for bimanual manipulation and multi-fingered robotic systems.
Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints
Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p<0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.
comment: 22 pages, 2 figures, 7 tables, 25 references
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $π_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.
comment: Embodied R1.5 technical report. Project page: https://embodied-r.github.io/
Model-based Optimization of Anguilliform Swimming Gaits for Soft Robotic Applications
In this paper, we introduce the Soft Lamprey-Inspired Dual Environment Robot (SLIDER) and a proper modeling and optimization procedure employed to design the robot. We represent the primary fluid environment actions - inertial effects, vortex forces, and viscous dissipation - using Lighthill's theory for large-amplitude elongated bodies. For structural design parameters such as internal pressure, tail size, and body stiffness, a fast, geometrically and materially nonlinear model is developed and validated. The fluid-structure interaction equations are solved implicitly with an efficient second-order box method. A pneumatic manifold robotic system is employed to actuate SLIDER in a quiescent water tank environment, allowing cross-comparison of computational and experimental results. We find that low-frequency swimming is dominated by resistant environmental forces, whereas higher-frequency swimming is primarily affected by inertial fluid forces. Using our efficient model alongside a genetic algorithm, we co-optimize a swimming control pattern and caudal fin design (subject to SLIDER's climbing morphology) to achieve a tethered swimming speed of 21.7 +/- 0.4 cm/s (0.59 Bl/s). Furthermore, we investigate the optimization procedure for a multimodal robot performing both swimming and climbing tasks.
Going with the Flow: Koopman Behavioral Models as Pseudo Planners for Visuo-Motor Dexterity
Contemporary visuo-motor dexterity models often rely on expressive policy classes with diffusion and transformer backbones to achieve strong performance. However, these architectures require significant data and computational resources, and remain far from reliable, particularly for multi-fingered dexterity. Importantly, they model skills as reactive mappings and rely on fixed-horizon action chunking, creating a rigid trade-off between temporal coherence and reactivity. To address these issues, we first introduce Unified Behavioral Models (UBMs), a framework to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. As such, UBMs ensure temporal coherence by construction rather than heuristic averaging. Unlike world models that attempt to predict the impact of arbitrary robot actions on the environment, UBMs target behavioral dynamics that encode how demonstrated robot behavior is related to desired impacts on the environment. A UBM can be viewed as a pseudo planner: given an initial condition, it computes the desired robot behavior over the entire skill horizon, while simultaneously ``imagining" the resulting flow of visual features. To operationalize UBMs, we propose Koopman-UBM, a first instantiation of UBMs as a structured latent linear system. K-UBM is computationally efficient, enabling reactivity and adaptation via an online replanning strategy: the model acts as its own runtime monitor, automatically triggering replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and four real-world tasks, our approach matches or exceeds the performance of state-of-the-art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.
comment: Website: https://k-ubm.github.io/
Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.
CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics
General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.
RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies
Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.
Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making
Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
comment: Some fundemental change in text and codebase
RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots IROS
Rapid aerial grasping through robots can lead to many applications that utilize fast and dynamic picking and placing of objects. Rigid grippers traditionally used in aerial manipulators require high precision and specific object geometries for successful grasping. We propose RAPTOR, a quadcopter platform combined with a custom Fin Ray gripper to enable more flexible grasping of objects with different geometries, leveraging the properties of soft materials to increase the contact surface between the gripper and the objects. To reduce the communication latency, we present a new lightweight middleware solution based on Fast DDS (Data Distribution Service) as an alternative to ROS (Robot Operating System). We show that RAPTOR achieves an average of 83% grasping efficacy in a real-world setting for four different object geometries while moving at an average velocity of 1 m/s during grasping. In a high-velocity setting, RAPTOR supports up to four times the payload compared to previous works. Our results highlight the potential of aerial drones in automated warehouses and other manipulation applications where speed, swiftness, and robustness are essential while operating in hard-to-reach places.
comment: 7 pages, 10 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022. Video: https://youtu.be/KHkBlBABsC8 Project page: https://srl-ethz.github.io/RAPTOR
QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models
Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
Rod models in continuum and soft robot control: a review
Continuum and soft robots can transform automation tasks requiring compliant interaction in constrained or unstructured environments, including healthcare, agriculture, marine, and space applications. However, their complex mechanics introduce significant challenges in modeling and control. Low-dimensional continuum mechanical models, such as rod theories, effectively capture the large deformations of slender bodies in contact-rich scenarios while balancing accuracy and computational efficiency. This paper presents a vertical survey of rod models for continuum and soft robots, spanning their mathematical foundations, robot modeling, and control applications. We review the main rod theories adopted in soft robotics and introduce a deformation-based classification of rod models for continuum and soft robots. Furthermore, we survey recent model-based and learning-based control strategies leveraging rod models, highlighting their role in manipulation and physical interaction tasks. Finally, we discuss advantages, limitations, research gaps, and emerging directions of rod-based approaches. This paper aims to serve as a reference for developing models and control strategies for continuum and soft robots.
CADENCE: Predicting Realized MAPF Execution Time Beyond Sum of Costs ICRA 2026
Multi-Agent Path Finding (MAPF) algorithms are increasingly used to plan motion for robot teams in industrial warehouses and robotic shared workspaces, but standard MAPF algorithm evaluation metrics, such as Sum of Costs (SoC), makespan, and planner runtime, can obscure how planner choices translate into realistic execution performance. We present CADENCE (Coordination and Action-Driven Estimation for Networked Continuous Execution), a hardware study of this evaluation gap on a fixed 7 by 7 workcell with seven differential drive robots, asking which features available before execution can best predict final wall-clock completion time. We compare SoC, total planned travel cost, primitive motion burden (how much basic motion the plan requires, such as makespan, turns, consecutive moves, and start-stop transitions), and interaction aware coordination structure (how much inter-robot coordination the plan induces, such as dependency links, interacting robot pairs, dependency depth, and crowding exposure). To test this, we generate 120 plans across 15 scenarios -- 5 Empty, 5 Medium Random, and 5 Bottleneck and execute each plan four times, yielding a 480 trial hardware corpus. Using both a scenario-held -- out ridge model and a trial-level mixed-effects model, we find that SoC alone is informative but incomplete, while primitive motion burden gives the strongest improvement, reducing held out error by about 48.6%-59.8% in MAE and 44.2%-61.4% in RMSE relative to SoC-only models. Interaction-aware coordination features add smaller, less uniform gains, most clearly in the mixed-effects analysis. Across both models and uncertainty checks, primitive motion burden is the most reliable additional signal beyond SoC, suggesting that much of the execution time gap is already visible in the offline plan before any robot starts moving.
comment: 7 pages, 4 figures, 3 tables and this paper was accepted at Multi-Agent Robotic Systems: Real-World Collaboration and Interaction a workshop at the international conference of robotics and automation (ICRA 2026)
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.
comment: Under Review
Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies ICML 2026
Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.
comment: Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation (DEMO)
Deterministic Execution of ROS 2 Applications via Lingua Franca
The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.
HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands
As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB
comment: 8 pages, 7 figures, conference
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
comment: 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,894 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 4,944 action judgment questions and 1,157 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results demonstrate that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.
TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving CVPR 2026
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
comment: Accepted at the Third Workshop on Simulation for Autonomous Driving (SAD), CVPR 2026
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.
Scalable and General Whole-Body Control for Cross-Humanoid Locomotion
Learning-based whole-body controllers have become a key driver for humanoid robots, yet most existing approaches require robot-specific training. In this paper, we study the problem of cross-embodiment humanoid control and show that a single policy can robustly generalize across a wide range of humanoid robot designs with one-time training. We introduce XHugWBC, a novel cross-embodiment training framework that enables generalist humanoid control through: (1) physics-consistent morphological randomization, (2) semantically aligned observation and action spaces across diverse humanoid robots, and (3) effective policy architectures modeling morphological and dynamical properties. XHugWBC is not tied to any specific robot. Instead, it internalizes a broad distribution of morphological and dynamical characteristics during training. By learning motion priors from diverse randomized embodiments, the policy acquires a strong structural bias that supports zero-shot transfer to previously unseen robots. Experiments on twelve simulated humanoids and seven real-world robots demonstrate the strong generalization and robustness of the resulting universal controller.
ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction
Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .
comment: Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.
comment: Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu and Haohui Huang
BadRobot: Jailbreaking Embodied LLM Agents in the Physical World ICLR 2025
Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.
comment: Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io
MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.
Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain
Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3\(\times\) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.
BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots
Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.
comment: Full Version of BiPenu, including the supplementary materials
FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand
We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.
Vision-Aided Relative State Estimation for Approach and Landing on a Moving Platform with Inertial Measurements
This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on $\mathbf{SO}(3)$ to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition for local exponential convergence in this setting is provided. The proposed observers are validated through extensive simulations.
comment: 13 pages, 4 figures. To appear in proceedings of IFAC World Congress 2026
Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents ICRA 2026
Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
comment: ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning
Multiagent Systems
LLM-Mediated Demand Response Coordination in Smart Microgrids
Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ($Δ_\mathrm{comp} = +0.063$), with the advantage preserved across both grounded and idealized agent substrates ($Δ= +0.083$) and across all resistance levels ($R = 0.1$ to $0.7$). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.
comment: Accepted for publication in 18th International Conference on Sustainability in Energy and Buildings (SEB-26), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer
Decentralized Multi-Agent Systems with Shared Context
Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage points, while LLM-authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage), producing structured improvement briefs that require no ground-truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28\% relative over unimproved LLM skills and closes 47--67\% of the gap to human-authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe-built skill library learns from past agent trajectories and raises pass rate from 16.0\% to 52.0\% using only 22 skills.
comment: 9 pages, under review
Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation
Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $Δ\le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.
comment: Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2026)
Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs
Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long-horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method that treats context management as a closed-loop dynamical process. GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per-turn utility remains tightly concentrated, with median = -0.19, P05 = -0.72, and P95 = 0.30; severe degradation below -1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e-3 s.
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.
comment: 23 pages, 6 figures
Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria IJCAI 2026
Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose $Φ$-Actor-Critic ($Φ$-AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, $Φ$-AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that $Φ$-AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.
comment: Accepted to IJCAI 2026
Multi-agent rendezvous in fluid flows via reinforcement learning
Rendezvous is a critical task for multi-agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit underlying fluid kinematics to facilitate convergence. In this study, we adopt a multi-agent reinforcement learning (MARL) approach to develop physics-informed rendezvous strategies in vortical flows. Compared to a naive strategy, where agents navigate toward their counterparts, MARL strategies significantly improve the rendezvous rate. MARL strategies also show transferability across varying vortex intensities, vortex scales, and swarm sizes. By breaking the symmetry of the state-action map, MARL strategy leverages a non-intuitive mechanism that prevents agents from becoming trapped in separate vortices, thereby enhancing rendezvous success. Additionally, a heuristic strategy is extracted from the learned strategy and also outperforms the naive strategy. Furthermore, a theoretical analysis demonstrates that fluid deformation impedes the rendezvous process. Large finite-time Lyapunov exponents identify where fluid effects separate adjacent agents, suggesting that targets should be planned in weak-deformation regions. Our findings reveal the important role that agent-fluid interactions play in multi-agent tasks and highlight the MARL capability to explore swarm intelligence in complex flow environments.
Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent
Artificial agents can be made to ``help'' through explicit social rewards, hard-coded prosocial bonuses, or direct access to another agent's state. I isolate a narrower route: homeostatic coupling. Building on ReCoN-Ipsundrum, I add a scalar homeostat and a social coupling channel while keeping action selection self-directed: the planner scores only the actor's predicted internal state, with no partner-welfare reward. In a one-step FoodShareToy, an exact solver finds a switch from EAT to PASS at $λ^\star \approx 0.91$ for the default state. In a multi-step SocialCorridorWorld, partner-state access without coupling leaves behavior unchanged, whereas coupled agents fetch, carry, and pass food to the partner. Sham lesions preserve helping; coupling-off and shuffled-partner lesions abolish it. A coupling/load sweep shows that coupling creates a low-load helping regime but does not guarantee rescue under higher metabolic load. This is not a claim about empathy, altruism, consciousness, or moral status. It is a minimal ALife demonstration that, in this controller, partner-state access is behaviorally inert unless partner distress is routed into self-regulation.
comment: Accepted at ALIFE 2026 Conference, Waterloo Institute for Complexity & Innovation
Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls
The ability to use tools is fundamental for large language model (LLM) agents. Given a task, existing systems use LLMs to plan and generate tool calls, which are executed by real-world tools to complete the task. However, tool calls are prone to errors because they are generated primarily from the intrinsic capabilities of LLMs. Moreover, while it is useful to let LLMs iteratively refine the tool-call sequence using execution results from real tools, this process can be expensive and may cause unsafe side effects. To improve LLM tool calls and address issues caused by using real tools for refinement, we introduce Gecko, a stateful simulation environment that provides informative feedback for refining LLM tool calls before real execution. Specifically, Gecko combines rules and LLMs to check the validity of tool names and arguments, synthesize schema-conforming and state-consistent responses, and judge task completion against the user objective. These three types of feedback allow LLMs to refine their tool calls in simulation, forming a simple yet effective test-time scaling method named GATS. On BFCLv3 and $τ^2$-bench, GATS consistently improves the performance of various LLMs.
From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs ISCA 2026
Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.
comment: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026
ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents
Large Language Model (LLM)-based agent systems are increasingly used for scientific tasks, yet their practical capability remains constrained by the narrow scope of manually curated tools they can invoke. Much scientific computational functionality already exists in open-source code repositories, but these resources remain difficult to standardize, operationalize, and invoke reliably for agent use. Here we present ToolRosella, a framework that automatically transforms heterogeneous scientific code repositories into standardized, agent-invocable tools. ToolRosella combines repository analysis, tool interface construction, execution testing, and iterative repair to address the problem of repository-to-tool standardization. Across 122 GitHub repositories spanning 35 subdisciplines in six domains, ToolRosella reaches a 61.5\% repository conversion success rate after iterative repair, with a 4.4 speedup over human engineers. The resulting 1,580 callable tools support a downstream task success rate of 84.0\% and improve performance when integrated into other agent frameworks, particularly on tasks whose required tools are absent from fixed, curated inventories.
comment: 20 pages
Transforming Privacy Artifacts into Accessible Reports for Non-Technical Stakeholders
The transition toward Industry 5.0 is reshaping industrial work environments with an emphasis on human-centricity, enabling close collaboration between humans and machines to enhance productivity and flexibility. However, such systems typically require monitoring of human workers and operators, often involving sensitive data, raising significant privacy concerns. As a result, affected workers and unions frequently reject human-machine collaboration features due to a lack of transparency regarding privacy threats and implemented mitigation strategies. To enable early stakeholder involvement, establish trust, and support informed decision-making, privacy implications must be communicated in a way understandable to non-technical stakeholders. Yet, current Requirements Engineering (RE) practices provide limited methodological support for making privacy threats and mitigations accessible to non-technical stakeholders (e.g., individual workers or their representative unions). In this paper, we propose a conceptual framework that guides software design from human monitoring-related use cases and requirements to informed decision-making guidance focusing on non-technical stakeholders. Building on principles such as Privacy by Design, the framework leverages Large Language Models (LLMs) to transform technical artifacts into accessible privacy reports. We share initial insights from two industry use cases, evaluate the quality of the generated reports, and outline future research directions toward integrating privacy transparency into RE processes for human-centric industrial systems.
comment: 8 pages (7+1), Accepted Version for publication at RE@Next'26
Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator
Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, including preliminary experiments with real human behavior as control. Results highlight possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.
comment: 9 pages
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.
Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding
Food assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.
comment: We have further refined the benchmark construction and experimental presentation to improve clarity and consistency. The revised version includes updated task design, food-resource data, and evaluation details to better align the benchmark with the intended food resource referral setting. These changes provide a more precise presentation of the experimental findings
Systems and Control (EESS)
QUIET: Quantifying Underutilized Influential Edges for Targeted Synchronization
Network control theory can be used to model intrinsic and extrinsic strategies to steer neural dynamics. Standard approaches are node-centric, structural, and focused on achieving desired instantaneous states. Here, we develop an edge-centric approach which incorporates both structure and function to achieve extended patterns of neural dynamics characterized by desired synchronization states. Our method, Quantifying Underutilized Influential Edges for Targeted Synchronization (QUIET), is an edge-centric framework that integrates structural controllability of individual white matter connections and mutual information between pairwise functional timeseries to identify energy-efficient synchronization pathways. QUIET identifies quiet highways, edges that are structurally influential but functionally underutilized, to optimize regional synchronization. We validated QUIET across 75 synthetic configurations, where QUIET-ranked edge sets significantly outperformed random selection in 93% of cases (p<0.01). The framework, tested on Human Connectome Project participants, revealed that the control energy required for synchronization of the salience network correlates with fluid intelligence. QUIET, applied to healthy adults undergoing dexmedetomidine-induced unresponsiveness, showed that the frontoparietal and default-mode networks exhibited the largest control energy required for synchronization in both awake and sedated states. QUIET is released as a stand-alone software to be used to study theoretically-defined synchronization pathways, which in turn could inform testable hypotheses in perturbative studies.
comment: 38 Pages; 6 Figures; 8 SIs
LLM-Mediated Demand Response Coordination in Smart Microgrids
Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ($Δ_\mathrm{comp} = +0.063$), with the advantage preserved across both grounded and idealized agent substrates ($Δ= +0.083$) and across all resistance levels ($R = 0.1$ to $0.7$). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.
comment: Accepted for publication in 18th International Conference on Sustainability in Energy and Buildings (SEB-26), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer
Free Parametrization of L_2-Bounded Structured State-Space Controllers for Nonlinear Control with Stability Guarantees
Designing stabilizing control policies for nonlinear systems while optimizing complex objectives remains a formidable challenge. Neural networks (NNs), despite their expressive power, can be highly sensitive to small input perturbations and can easily destabilize the closed-loop system. Existing approaches often impose explicit constraints on the controller's parameters to ensure stability, but this typically leads to additional computational overhead. To address this issue, we leverage recently proposed structured state-space models (SSMs) to parametrize discrete-time control policies for nonlinear systems. Our key contribution is a new free parametrization of linear time-invariant (LTI) systems with a prescribed L2 gain. We use this result to construct the L2-Recurrent Unit (L2RU), an SSM layer that enforces the desired L2 bound by design. The resulting architecture can be used to guarantee closed-loop stability via the small-gain theorem or the so-called performance-boosting framework, independently of the controller's optimization parameters, thereby enabling fully unconstrained optimization of general nonlinear objectives. Furthermore, the structure induced by the proposed parametrization enables the efficient processing of long input sequences, as it is highly parallelizable through algorithms such as parallel scan. We demonstrate the effectiveness of this approach on a formation-control task for mobile robots, where the L2RU-based controller ensures collision and obstacle avoidance while maintaining stability and performance.
A Companion App for an Autonomous Family Vehicle: Identification of Values for an Autonomous Mobility System
In this paper, we present a companion app for an autonomous vehicle aimed at user groups who would normally require an accompanying person to drive them. Two aspects of a companion app are presented in this paper: First, the possibility for a trusted person to track the ride of the person in need of support and second, to put the settings of the vehicle for persons in need of support in the hands of a trusted person. In addition, this article describes the requirements and addressed values and discusses the safety-relevant aspects of such a companion app. We also discuss and identify the values that influence passengers and trusted persons using the companion app. Overall, a companion app can provide new perspectives and opportunities for people in need of support, allowing them to take advantage of the features offered by autonomous vehicles. It enables trusted individuals to configure the vehicle according to the passengers needs. Also such an app can be a mechanism to involve trusted persons in the options given by the vehicle and give them the possibility to adapt the vehicle to the needs of the person in need of support.
comment: Accepted to be published in the 2026 IEEE Intelligent Vehicles Symposium (IV)
Multi-UAV Active Sensing with Information Gain-based Planning and Belief Fusion
Unmanned aerial vehicles (UAVs) are increasingly used for active sensing and information gathering in spatially distributed environments. Their performance, however, is constrained by limited flight time, sensing uncertainty, and the trade-off between spatial coverage and observation accuracy. This paper presents a real-world validation of a multi-UAV active sensing framework for probabilistic binary terrain mapping, with precision agriculture used as the application case. The environment is represented as a probabilistic belief map, where spatial dependencies are modeled through a factor-graph formulation. UAV decision making is guided by Information Gain based Informative Path Planning (IGbIPP), and the approach is compared with Random Walk and Sweep coverage path planning baselines using both synthetic terrains and real UAV-derived agricultural imagery. The study also evaluates spatial correlation weights and several probabilistic belief-fusion rules for multi-UAV information sharing. Results show that IGbIPP reduces entropy and mapping error more effectively than the baselines, while a wider field of view improves real-world coverage and map accuracy. The results further show that simple equal or biased spatial weights can be more robust than adaptive weights, and that Bayesian, log-odds, and Dempster--Shafer fusion achieve the best cooperative mapping performance. These findings highlight the importance of uncertainty-driven planning, sensing geometry, spatial modeling, and probabilistic fusion for real-world UAV-based active sensing.
Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection
Precise state estimation for navigation of autonomous agricultural robots is often compromised by sensor outages (GNSS/LiDAR/Visual) and high-frequency vibrations inherent in off-road environments. This paper proposes a robust navigation algorithm based on a jerk-augmented Extended Kalman Filter (EKF) integrated with a Multiple Tuning Factor (MTF) adaptation method. Unlike standard EKF approaches that assume constant measurement noise, our method dynamically adjusts the measurement covariance matrix in real-time, allowing the system to cope with sudden disturbances and sensor outliers. We evaluate the algorithm using real-world data from a Salin247 autonomous robot. Results demonstrate that jerk-augmentation combined with MTF adaptation significantly reduces 3D position Root Mean Square Error (RMSE) compared to baseline EKF models, providing superior dead-reckoning capabilities.
Spectral Koopman Approach for Reconstructing State-space Geometry of Cislunar Restricted 3-Body Problem
In this work, we propose a novel approach, based on the path integral formulation of Koopman spectrum, to discover the phase-space geometry of the planar Cislunar Restricted 3 Body Problem (CR3BP). In contrast to existing techniques, which use trajectory-based (usually) local analysis, we leverage the Koopman operator framework, which generates a global linear \emph{representation} of the system, to reconstruct the global phase space geometry of the CR3BP. In particular, we compute the principal eigenfunctions of the Koopman operator via the path integral approach and show how the zero level curves of these eigenfunctions encode the phase space characteristics of the planar CR3BP.
Robust Current Regulation of MMC-based MTDC Power Systems based on Lyapunov Inequality
Multi-terminal DC (MTDC) transmission systems based on modular multilevel converters (MMCs) are a key component of the envisioned future energy sector, where sustainability and efficiency are increasingly prioritized. To ensure their reliable operation, MMC currents must be regulated safely and rapidly under a wide range of uncertain operating conditions. Consequently, the design of current controllers faces a fundamental challenge: achieving fast transient response while maintaining robustness against uncertainties. This paper addresses this challenge by proposing a linear matrix inequality (LMI)-based design framework that leverages Lyapunov stability conditions to synthesize a less conservative static state-feedback controller. The proposed design method explicitly accounts for system constraints, including input saturation and overcurrent limits. The proposed method effectiveness is assessed on the CIGRE MT-HVDC benchmark, simulated in RTDS, and compared with existing methods.
comment: Publication Submitted and Accepted to SEST26
Temperature-Aware Heat Pump Modeling for Large-Scale Energy System Optimization
Heat pumps are expected to dominate the heating sector, substantially increasing peak electricity demand. At the same time, building thermal inertia enables operational strategies, providing temporal flexibility in heat pump operation and short-term demand response. However, this dynamic behavior is not yet represented in large-scale energy system optimization models. To address this gap, we present an innovative formulation of building thermal inertia. The resulting temperature variable is integrated into a novel conic temperature-aware heat pump efficiency formulation, enabling a more precise emulation of smart control strategies. In a case study of the European energy system, we show that the approach captures operational heating flexibility while remaining computationally efficient. The results indicate substantial untapped flexibility potential, enabling up to a 22% reduction in heating-related electricity costs. This potential can be realized through a suitable energy market design that incentivizes coordinated heat pump control, individually or via aggregators.
comment: 6 pages, 3 figures. Accepted for the International Conference on the European Energy Market 2026
Gradient based Bilevel for Inverse Optimal Control, a Riemannian approach
Inverse Optimal Control (IOC) aims to recover the cost function that explains observed trajectories as solutions of an optimal control problem. Classical IOC formulations rely on bilevel optimization, which repeatedly solves a nested optimal control problem and quickly becomes computationally prohibitive for realistic systems. Recent projection-based approaches offer a promising alternative but suffer from numerical instability when solved with gradient-based methods due to violations of standard constraint qualifications. In this paper, we show that these difficulties stem from the geometric structure of the IOC feasible set. We demonstrate that the set of trajectories satisfying the optimality conditions naturally forms a manifold and reformulate IOC as an optimization problem on this manifold. Based on this insight, we propose a Riemannian Inverse Optimal Control (RIOC) method that projects observed trajectories onto the manifold of optimal solutions while preserving feasibility by construction. Experiments on real human arm trajectories show that the proposed method achieves comparable or better reconstruction accuracy than classical bilevel IOC while reducing computation time by about a factor of four. These results highlight the potential of geometric optimization methods to improve the scalability and reliability of IOC for robotics and human motion analysis.
comment: 6 Pages, 4 Figures. To be published in a control journal
Direct Data-Driven Approximate Optimal Control of Nonlinear Input-Affine Systems
In this paper, we combine a data-driven system representation with a framework to systematically construct (approximate) solutions to nonlinear optimal control problems. By immersing the unknown dynamics into an extended state space, solutions are characterised via purely data-dependent algebraic conditions. This allows us to design dynamic state-feedback controllers with local stability and performance guarantees for unknown nonlinear, input-affine systems directly using data, without explicitly identifying the dynamics.
comment: This work has been accepted to IFAC World Congress 2026 for publication under a Creative Commons Licence CC-BY-NC-ND
Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication
Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.
Toward Proactive RF Charging Scheduling: Generative AI for Decision Support
Radio frequency wireless power transfer (RF-WPT) is an enabling technology for supporting uninterrupted communications in future Internet of Things systems by reducing the need for battery replacement and mitigating battery-waste-related issues. For large-scale RF-WPT deployment, one of the main challenges is the scheduler-level resource allocation. Specifically, the transmitter must decide how much energy to deliver, when, and to whom, under limited charging resources, incomplete receiver-side information, and uncertain near-future charging conditions. This article positions generative artificial intelligence (GenAI) as a promising tool for this setting because it can foresee multiple plausible charging scenarios conditioned on coarse operational context and receiver-side information. We propose GenAI to act as an uncertainty-aware support layer for the RF-WPT scheduler rather than as a standalone forecasting or decision-making tool. To this end, we first revisit the main challenges of RF-WPT scheduling, and discuss how major GenAI families can support uncertainty-aware charging decisions by generating scenario-based inputs for downstream tasks. We then present a warehouse-style case study showing that preserving uncertainty through the sampling capability of generative models can improve robust charging decisions compared with deterministic prediction and simple non-learning baselines, especially under risk-sensitive objectives. Finally, we identify key open challenges and present some directions for future research.
Embedding Hybrid Systems into Continuous Latent Vector Fields ICML 2026
This work proves that an $n$-dimensional hybrid system can be embedded into an $m$-dimensional Euclidean space equipped with a continuous vector field on its embedded image whenever $m>2n$. This result suggests that an intrinsically discontinuous hybrid system generically admits a continuous extrinsic representation that is well-posed for differentiable optimization. Building on this existence theorem, we show that a latent Neural ODE with consistency loss in both the latent and state space can accurately recover the flow of hybrid systems. Extensive experiments suggest the proposed method outperforms the existing method in learning hybrid systems with varying geometries from only time series data.
comment: Accepted to ICML 2026
Transient Stability of Offshore Energy Hubs
Offshore energy hubs (OEHs) use grid-forming modular multilevel converters (MMCs) to enable large-scale offshore wind integration and multi-terminal HVDC operation. In HVDC-connected offshore wind farms and OEHs, the offshore grid-forming HVDC converters absorb active power from an offshore AC grid supplied by the wind farms and convert it to DC power for transmission to the onshore grid. Converter current limiting under different fault types in this setting is an understudied topic in the literature, which mostly focuses on power-injecting converters. This paper proposes a unified current-limiting strategy that combines a variable virtual impedance (VVI), based on a smooth threshold function, with a novel virtual-power (VP) mechanism derived from the power dissipated in the virtual resistance. The VVI ensures current limitation during fault-induced overcurrents while preserving voltage-source behavior, whereas the VP mechanism adds a compensating power term into the synchronization loop, enabling automatic power redistribution among converters. P-delta analysis further shows that a more resistive VVI can improve the transient stability of power-absorbing converters, while the proposed VP mechanism further enlarges the stability margin. EMT simulations validate that the combined VVI-VP strategy limits fault currents, maintains synchronism during severe faults, and achieves coordinated post-fault power sharing in fully converter-based OEHs.
comment: 10 pages, 11 figures, 2 tables, journal paper
LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies
Designing dynamically feasible trajectories for rigid bodies is a fundamental problem in robotics. While direct methods are widely used, the existing constrained optimizers typically operate in Euclidean space and ignore the manifold structure of rigid body motions. This mismatch may introduce singularities or lead to poorly conditioned optimization problems. To bridge this gap, we develop a structure-aware framework for constrained trajectory optimization directly on matrix Lie groups. Our approach is based on the second-order rigid body models utilizing Lie group structures, which enables efficient Newton-type updates while preserving the underlying geometry. Building on this model, we propose a line-search Lie Group Interior Point Method (LieIPM) to handle constraints on the manifolds. We instantiate the framework for rigid body motion planning using Lie group variational integrators and derive closed-form intrinsic derivatives that exploit group symmetries. The LieIPM preserves the topology of rotation motions by construction and avoids singularities. Numerical results demonstrate superior robustness and faster convergence compared to general-purpose solvers and structure-exploiting optimal control methods.
From Stacks to Circuits: A Regenerative Socio-Technical Roadmap for AI Infrastructure within Planetary Boundaries
Current scaling trajectories for Generative AI, typified by linear supply-side "stacks," prioritize performance density while externalizing significant thermodynamic and material costs. As the "Twin Transition" of green and digital transformation accelerates, the industry faces technology gaps - including Scope 3 emissions and e-waste recycling - that impede sustainable scaling and lead to social tensions. This study proposes a Regenerative Socio-Technical roadmap that repurposes the Sustainable Production and Consumption system map to reframe artificial intelligence infrastructure as a system-of-systems governed ultimately by planetary limits. By integrating the Institute of Electrical and Electronics Engineers International Roadmap for Devices and Systems (IEEE IRDS) sustainability considerations for semiconductor facilities, the study proposes a metabolic circuit framework that centers "Values and Needs" within production and consumption relationship loops. This study identifies critical gaps in current Nvidia-centric roadmaps and proposes a competing reference architecture. It demonstrates how a spontaneous order of resource parsimony and planetary accountability can provide an actionable pathway for regulatory compliance and industrial resilience in the digital circular economy.
comment: This document is a working paper and reflects the state of research as of May 2026. Comments are welcome and should be directed to the corresponding author at h.liao@ieee.org. This work is accepted for presentation at the 32nd IEEE ICE/ITMC Conference, Porto, Portugal
Backstepping Control of Multidimensional Coupled First-Order Hyperbolic PDEs with Collinear Velocities
This paper addresses the backstepping boundary stabilization of coupled multidimensional first-order hyperbolic systems. We consider systems whose transport velocity fields are collinear, meaning that each velocity field is a scalar multiple of a common base velocity field. Building upon a recent framework developed for scalar multidimensional first-order hyperbolic equations, we introduce a change of variables, based on characteristic curves defined entirely in the spatial domain, that converts the original multidimensional system into a continuum of coupled one-dimensional first-order hyperbolic systems. By designing a backstepping controller for each system in the continuum representation, and assuming that the transit times of the characteristic curves are uniformly bounded, we achieve finite-time stabilization of the multidimensional system.
Dynamic Optimization of Virtual Inertia and Damping in Converter-Based Power Systems
The transition towards a sustainable power system is enabled by the replacement of conventional synchronous generators with converter-interfaced renewable energy sources. However, the resulting loss of rotational inertia and governor damping causes significant frequency deviations and can therefore cause instability. The focus of this paper is the optimal allocation of virtual inertia and damping in the power system activated by established converter control schemes. To this end, we propose a novel dynamic optimization algorithm that considers performance metrics for system stability, cost-efficiency, and resilience. In addition, our algorithm considers the magnitudes and locations of disturbances in the power system for the optimal allocation. Finally, we validate our approach on a three-area system and also compare our results with a $\mathcal{H}_2$ system-norm-based allocation approach.
Koopman Modeling and Stabilization of Discrete-Time Nonlinear Control Systems: Bilinearity on a Reproducing Kernel Hilbert Space
Despite the popularity of Koopman modeling for nonlinear systems, in the presence of input variables, the evident nonexistence of a fully linear time-invariant model even in infinite dimensions makes Koopman-based control largely an open problem to date. Focusing on discrete-time systems in this paper, which eschews from using operator semigroup and infinitesimal generator notions, it is proven that nonlinear systems, if satisfying appropriate smoothness and regularity conditions, can be expressed exactly as bilinear dynamics, when the state variables and input variables are separately lifted into their reproducing kernel Hilbert spaces (RKHSs). To account for the knowledge of an equilibrium point at the origin, the RKHS is defined by a linear--radial product kernel, and hence the functions belonging to this RKHS are spanned by the multiplications of component functions and Sobolev functions. The stabilization problem, namely the determination of a feedback law that causes a Lyapunov function (expressed as a kernel sum-of-squares form) to decrease, is then posed as an infinite-dimensional optimization problem over state-dependent conditional probability measures over the input space, solved via a discretization scheme.
comment: 7 pages, 3 figures, invited by the 62nd Allerton Conference on Communication, Control, and Computing
On Time-Delay Compensators for Delayed-Output Systems
This paper advances the practical utility of functional observer theory by addressing sensing latency in linear time-delay systems. We address the estimation of the functional $z(t)=Fx(t)$ in cases where the measurement delay $h$ is independent of the internal state delay $τ$, with a specific focus on the condition $0 < h < τ$. To compensate for sensing lags, we propose a functional observer structure characterized by multiple internal delays and an augmented architecture. Algebraic existence conditions are established alongside a constructive synthesis procedure. By incorporating an additional delayed measurement vector, we demonstrate that this approach significantly expands the design space and is applicable to a wider class of systems with larger state and output delays.
Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks
Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.
OmniLoc: A Geometry-Aware Foundation Model for Anchor-Free UE Localization Across Diverse Indoor Environments
Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.
Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems
In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.
From Symmetry to Stability: Quantifying Converter Grid Impedance Asymmetry as Indicator of Stability Margin
Although symmetricity in the converter controller is desirable for robust stability margins, a direct link between system-level asymmetricity and instability has yet to be clearly established. Converter control introduces three-phase asymmetricity through loops such as DC-link voltage control, a phase-locked loop , and a power synchronization loop. Furthermore, the inherently asymmetric topology of the two-level voltage-source converter, which converts a DC voltage into a three-phase balanced set, acts as the underlying origin of the asymmetries that propagate into the control structure. Consequently, establishing a direct relationship between system asymmetricity (rather than control asymmetricity alone) and the stability margin is essential for understanding the underlying instability mechanisms. In this work, asymmetricity is quantified using the Asymmetricity Quantification Index (AQI), derived from the sequence-domain representation of the interconnected converter-grid impedance. Within this domain, symmetricity is identified through the definition of symmetrical matrices, which serve as the benchmark against which asymmetricity is measured. A robust and generalized analysis correlates AQI with the stability margin, including both grid-following and grid-forming control structures connected to the power grid. It is found that instability arises from increased asymmetricity in the combined converter-grid system, which is dominated by asymmetric control loops and operating points. Thus, reducing asymmetricity without compromising controller functionality can improve stability margins. The analysis is validated in both control-hardware-in-the-loop and power-hardware-in-the-loop environments.
Probabilistic Repair Logistics Modeling for Utility-Scale PV Inverter Fleets Using Event-Driven Simulation
As renewable energy systems expand, inverter availability becomes increasingly important for grid reliability and economics, yet photovoltaic inverter repair logistics remain under-modeled. This paper presents an event-driven Monte Carlo framework for a centralized repair facility with parallel production lines, capturing the full repair cycle from administrative pre-wait and transport to health-driven repair and return-to-inventory. The model incorporates opportunistic scheduling that uses mandatory hold periods to insert additional units onto temporarily idle lines, improving throughput without added capacity. Stage durations are represented by a two-component VaR-style mixture distribution for routine and heavy-tailed delays, while a continuous health score determines repair completion. Calibrated by minimizing the one-dimensional Wasserstein distance between simulated and empirical repair-duration distributions, the model is applied to 43 field-observed repairs, reproducing the empirical bimodal structure with a Wasserstein distance of 53.3 days. Results show that 51.2% of units are accommodated through opportunistic insertion, indicating that hold periods provide a significant recoverable scheduling resource.
An Admittance-Based Inverter Connection Screening Tool for Small-Signal System Strength
The increasing occurrence of small-signal instability, particularly sub-synchronous oscillations (SSOs), in power systems with a high penetration of inverter-based resources (IBRs) has made the planning of new IBR connections increasingly important and challenging. The impact of such connections on small-signal stability is not always straightforward, as it strongly depends on the connection location, inverter operating mode, control configuration, parametrisation, and operating conditions. This paper proposes an inverter connection screening tool (ICST) that enables efficient and accurate assessment of the impact of prospective inverter configurations on small-signal system strength. It can identify, among the candidates considered, the most suitable inverter configuration for a given connection location that avoids degrading small-signal system strength and can also enhance it. As a result, higher IBR penetration can be supported while maintaining small-signal stability. The ICST evaluates candidate inverter configurations using their admittances at critical modal frequencies, along with the system's admittance spectrum, thereby avoiding the need for analytical models. The ICST-based planning procedure, which can support system operators, asset owners, and IBR developers in decision-making across different stages of planning studies, is demonstrated using a modified IEEE 57-bus system. Comparisons with model-based studies demonstrate the accuracy of the ICST in predicting the modal impact of inverter connections and its effectiveness in selecting suitable inverter control configurations.
comment: 10 pages, 10 figures, 9 tables
Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry UAI
We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.
comment: Accepted to UAI
Towards a Joint Understanding of Remote Operation for Vehicles in Public Road Traffic
Sustained driving automation systems are envisioned to be used as the foundation for driverless mobility services. However, both researchers and practitioners acknowledge that current driving automation systems are not yet able to handle all traffic situations that a human driver can handle. To bridge this gap and enable mobility services without an in-vehicle human driver or fallback, remote operation (or teleoperation) is increasingly discussed. Recently, first legal actions have been taken to enable some forms of remote operation on public roads. Remote operation encompasses a broad spectrum of methods to support a driving automation system, ranging from remote assistance, which includes providing information or releasing a maneuver, to remote driving, which includes driving the vehicle from a remote location. As such, safe implementation of remote operation in public road traffic challenges the collaboration of multiple academic disciplines (e.g. engineering, psychology, informatics, law, etc.) and stakeholders (e.g. remote operation service providers, remote operators, vehicle manufacturers, regulatory authorities, etc.). At the same time, the interdisciplinary discourse is often challenging due to differing expectations and language. To build a common ground, this article traces terminology back to the original differences in information processing both on human and vehicle side. This framework aims to help further discourse by directly specifying what is needed to engage a diverse audience including researchers and stakeholders of different backgrounds and interests. Recently discussed forms of teleoperation are integrated into this framework.
Shared Renewable Allocation and Hydrogen Flexibility in Local Energy Markets: A Market Design Perspective
The integration of green hydrogen in local energy markets is often analyzed from a technical flexibility perspective, while the effect of market design rules remains less explored. This paper proposes a coordinated local electricity-hydrogen market framework in which hydrogen participation is regulated by explicit renewable access mechanisms. A mixed-integer linear programming model is developed to co-optimize electricity trading, battery operation, wind allocation and hydrogen production under centralized coordination. Six regulatory cases are examined including hydrogen supply options and access of local wind. Results are obtained for representative seasonal weeks for Norwegian energy community. Electrolyzer, when connected as rigid load, increases grid dependence, but also improves system cost when price-based participation is activated. Direct renewable access reduces grid imports, enhances wind allocation and introduces competition with households for energy distribution and system cost optimization. Furthermore, findings show that (i) hydrogen integration in local energy systems is essentially a market design problem and (ii) renewable access rules critically determine system behaviour, flexibility interactions and seasonal performance.
Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.
Stochastic Differential Dynamic Programming for Trajectory Optimization under Partial Observability
Designing spacecraft trajectories remains challenging in the presence of stochastic effects such as maneuver execution errors and observation uncertainties. Although covariance control and belief-space planning provide useful tools for designing robust control policies and information-aware trajectories under uncertainty, practical methods remain limited for partially observable trajectory optimization problems in which trajectory design, orbit determination, and correction maneuver planning are tightly coupled. This paper presents a stochastic differential dynamic programming algorithm for such coupled problems. The proposed method optimizes the nominal control sequence and feedback gains subject to a belief-state transition model and general mission constraints, explicitly accounting for the dependence of covariance propagation on the nominal trajectory without relying on the separation principle. Numerical examples demonstrate that the proposed algorithm produces navigation-aware and uncertainty-robust solutions across a range of dynamical systems, observation models, and uncertainty levels.
comment: Revised version; 38 pages, 13 figures; submitted to the Journal of Guidance, Control, and Dynamics
Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making
Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.
Decentralized Parametric Stability Certificates for Grid-Forming Converter Control
We propose a decentralized framework to analytically guarantee the small-signal stability of future power systems with grid-forming converters. Our approach leverages dynamic loop-shifting techniques to compensate for the lack of passivity in the network dynamics and establishes decentralized parametric stability certificates, depending on the local device-level controls and incorporating the effects of the network. By following practical tuning rules, we are able to ensure plug-and-play operation without centralized coordination. Unlike prior works, our approach accommodates coupled frequency and voltage dynamics, incorporates network dynamics, and does not rely on specific network configurations or operating points, offering a general and scalable solution for the integration of power-electronics-based devices into future power systems. We validate our theoretical stability results through numerical case studies in a high-fidelity simulation model.
comment: 14 pages, 17 figures
Data-Based Analysis of Relative Degree and Zero Dynamics in Linear Systems
Data-driven control offers a powerful alternative to traditional model-based methods, particularly when accurate system models are unavailable or prohibitively complex. While existing data-driven control methods primarily aim to construct controllers directly from measured data, our approach uses the available data to assess fundamental system-theoretic properties. This allows the informed selection of suitable control strategies without explicit model identification. We provide data-based conditions characterizing the (vector) relative degree and the stability of the zero dynamics, which are critical for ensuring proper performance of modern controllers. Our results cover both single- and multi-input/output settings of discrete-time linear systems. We further show how a continuous-time system can be reconstructed from three sampling discretizations obtained via Zero-order Hold at suitable sampling times, thus allowing the extension of the results to the combined data collected from these discretizations. All results can be applied directly to observed data sets using the proposed algorithms.
RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots IROS
Rapid aerial grasping through robots can lead to many applications that utilize fast and dynamic picking and placing of objects. Rigid grippers traditionally used in aerial manipulators require high precision and specific object geometries for successful grasping. We propose RAPTOR, a quadcopter platform combined with a custom Fin Ray gripper to enable more flexible grasping of objects with different geometries, leveraging the properties of soft materials to increase the contact surface between the gripper and the objects. To reduce the communication latency, we present a new lightweight middleware solution based on Fast DDS (Data Distribution Service) as an alternative to ROS (Robot Operating System). We show that RAPTOR achieves an average of 83% grasping efficacy in a real-world setting for four different object geometries while moving at an average velocity of 1 m/s during grasping. In a high-velocity setting, RAPTOR supports up to four times the payload compared to previous works. Our results highlight the potential of aerial drones in automated warehouses and other manipulation applications where speed, swiftness, and robustness are essential while operating in hard-to-reach places.
comment: 7 pages, 10 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022. Video: https://youtu.be/KHkBlBABsC8 Project page: https://srl-ethz.github.io/RAPTOR
Distributionally robust two-stage model predictive control: adaptive constraint tightening with stability guarantee
This paper proposes a two-stage distributionally robust model predictive control (TSDR-MPC) scheme for stochastic disturbances with unknown time-varying means and covariances. By defining a Wasserstein ambiguity set on the disturbance-to-constraint space, constraint violation penalties are formulated as a second-stage problem, enabling adaptive tightening. A finitely convergent cutting-plane algorithm is developed for real-time implementation. The framework naturally degrades to deterministic MPC as uncertainty vanishes, without pre-specified tightening parameters. Theoretical guarantees include feasibility, finite-time termination, and an asymptotic average cost bound. Numerical simulations validate its adaptability and robustness.
Backstepping Control of First-Order Hyperbolic Equations in Arbitrary Dimensions with Non-Trapping Characteristics
This paper presents a backstepping approach for the boundary control of first-order hyperbolic equations with spatially varying coefficients posed on domains of arbitrary dimension. The method is based on a change of variables induced by the characteristic flow of the time-invariant transport operator, transforming the original multidimensional system into a continuum of decoupled one-dimensional hyperbolic equations evolving along individual characteristic curves. A backstepping controller is then designed for each equation in the decomposition, and the resulting control laws are reassembled in the original coordinates to achieve finite-time stabilization of the full system. The framework relies on the existence of characteristic curves foliating the spatial domain, with uniformly bounded transit times (non-trapping).
Optimal Control and Dissipativity of Linear Hermitian Matrix-Valued Dynamical Systems
We develop a unified framework for linear-cost optimal control, finite-time optimal steering, dissipativity analysis, and zero-sum differential games for linear impulsive systems whose state is a Hermitian matrix evolving in $\mathbb{H}^{n+m}_{\succeq0}$, a class that encompasses continuous- and discrete-time linear systems and switched systems as degenerate cases, and includes the second-order moment dynamics of linear (stochastic) hybrid systems. The entire theory rests on three tools: a single \emph{key identity} relating cost, trajectory, and a dual variable, an Extended Schur complement lemma, and a Schur inner-product decomposition, applied identically to the flow integral and to each jump. These yield structurally uniform sufficient and necessary conditions, dual linear matrix inequality (LMI) characterizations, and explicit optimal policies for every problem class, on both finite and infinite horizons under time-varying assumptions (without time invariance or periodicity), together with causal dwell-time policies for the problems that admit them.
comment: 86 pages
Waste-to-Energy-Coupled AI Data Centers: Cooling Efficiency and Grid Resilience
AI data-center expansion is increasingly constrained by the coupled availability of deliverable electricity and heat-rejection (cooling) capacity. We propose and evaluate an integrated Waste-to-Energy-AI Data Center configuration that treats cooling as a first-class energy service rather than an unavoidable electricity burden. The coupled system is modeled as an input-output 'black box' with transparent boundaries and a standalone benchmark in which mechanical chilling is powered by grid electricity. The central mechanism is energy-grade matching: low-grade WtE thermal output drives absorption cooling to deliver chilled service, thereby displacing baseline cooling electricity. We show that thermoeconomic superiority is governed by three first-order determinants, (i) cooling coverage of IT heat load, (ii) parasitic electricity for transport and auxiliaries, and (iii) distance-driven delivery decay, yielding a break-even corridor beyond which net benefits vanish. Comparative statics characterize sensitivity to IT utilization, feedstock quality (waste LHV and throughput), climate parameterization, and corridor distance. We translate these accounting gains into decision language through a computable prototype for Levelized Cost of Computing (LCOC) and an ESG valuation channel grounded in measurable mechanisms, without re-deriving full lifecycle inventories. The framework provides siting-ready feasibility conditions for WtE-AIDC coupling in urban AI corridors under grid stress.
Ignore Drift, Embrace Simplicity: Constrained Nonlinear Control through Driftless Approximation
We present a novel technique to drive a nonlinear system to reach a target state under input constraints. The proposed controller consists only of piecewise constant inputs, generated from a simple linear driftless approximation to the original nonlinear system. First, we construct this approximation using only the effect of the control input at the initial state. Next, we partition the time horizon into successively shorter intervals and show that optimal controllers for the linear driftless system result in a bounded error from a specified target state in the nonlinear system. We also derive conditions under which the input constraint is guaranteed to be satisfied. On applying the optimal control inputs, we show that the error monotonically converges to zero as the intervals become successively shorter, thus achieving arbitrary closeness to the target state with time. Using simulation examples on classical nonlinear systems, we illustrate how the presented technique is used to reach a target state while still satisfying input constraints. In particular, we show that our method completes the task even when assumptions of the underlying theory are violated.
comment: 13 pages, 8 figures
Vision-Aided Relative State Estimation for Approach and Landing on a Moving Platform with Inertial Measurements
This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on $\mathbf{SO}(3)$ to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition for local exponential convergence in this setting is provided. The proposed observers are validated through extensive simulations.
comment: 13 pages, 4 figures. To appear in proceedings of IFAC World Congress 2026
Semantic Technologies in Practical Demand Response: An Informational Requirement-based Roadmap
The transition to a modern and efficient future grid relies on the seamless coordination of distributed energy resources and applications such as Demand Response (DR). While this transformation enables greater flexibility, it increases grid complexity and decentralization, requiring the effective coordination of millions of hardware assets and software agents. Realizing this vision demands advances in interoperability to ensure these heterogeneous systems can communicate without prohibitive customization costs. Semantic interoperability aims to address this by leveraging ontologies to guarantee the unambiguous interpretation of exchanged data. However, current ontologies in the commercial building and DR domains face two critical limitations. First, existing ontologies are often developed without a formal framework that reflects real-world DR requirements. Second, proposals for integrating general and DR-specific ontologies remain mostly conceptual, lacking formalization or empirical validation. In this paper, we begin to address these gaps by applying a formal ontology evaluation/development approach to define the information requirements (IRs) necessary for semantic interoperability, focusing on incentive-based DR programs for commercial buildings in the United States as a starting point. We identify the IRs associated with each stage of the incentive-based DR. Using these IRs, we evaluate how well existing ontologies, specifically Brick, DELTA, EFOnt, and CIM support the operational needs of DR participation. Our findings reveal substantial gaps between current ontologies and practical DR requirements and we propose a roadmap of necessary extensions and integrations for these ontologies. This work ultimately aims to enhance the interoperability of today's and future smart grid, thereby facilitating scalable integration of DR systems into the grid's complex operational framework.
comment: Accepted at ACM eEnergy 2026. Not yet published/
Online Learning for Supervisory Switching Control
We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy a suitable controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the matching controller in $O(N \log^2 N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.
Move Over, Prisoner's Dilemma: Colonel Blotto has arrived
The Prisoner's Dilemma, zero-sum games, LQR team problems, and differential games have shaped game theory in controls for decades, but the field's most pressing adversarial challenges demand a richer framework, and its name is Colonel Blotto. Strategic adversarial constraints represent a fundamental consideration in control systems, from cybersecurity defense to infrastructure protection. Colonel Blotto games, despite their direct relevance to such applications, remain underutilized in the controls community relative to other game-theoretic approaches. This article aims to close that gap for the controls community. Indeed, theoretical advances within the last two decades have spurred a resurgence of interest and enabled their applications across several domains. In this article, we introduce the Colonel Blotto framework, survey key analytical and computational results, and demonstrate how problems spanning cybersecurity, network defense, and multi-agent systems fit naturally within this structure. Three research directions are examined in depth: interdependent contest objectives that capture networked vulnerabilities, alternate winning rules that model partial rewards and structural asymmetries, and multi-agent competitive environments involving coalition formation and strategic concessions. Taken together, these directions reveal a framework that is both practically deployable and rich enough to capture the strategic complexity inherent in adversarial resource allocation.
Robotics
MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web
comment: The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web
iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.
comment: Project page: https://imac-wm.github.io/
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.
comment: Project page: https://serene-sivy.github.io/aha-wam/
SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps
Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.
AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning
Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.
Difference-Aware Retrieval Policies for Imitation Learning ICLR 2026
Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.
comment: 12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/
Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models
Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.
comment: Under review
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models
Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5 variations. We propose PROBEACT, a training-free runtime intervention frame-6 work that detects and recovers from grasping and placement failures in pre-7 trained VLA policies without modifying their weights or requiring additional8 demonstrations. PROBEACT combines three components: (i) a lightweight multi-9 target hidden-state probe that predicts the 3D positions of task-relevant objects10 from intermediate VLA features, with Hungarian-matched identity tracking for11 multi-object scenes; (ii) an object-agnostic kinematic state machine that detects12 grasp, transport, and placement failures using only gripper-internal signals and13 end-effector kinematics; and (iii) a hierarchical Control Barrier Function (CBF)14 filter that encodes repeated-failure locations as soft safe-set constraints, mini-15 mally correcting VLA actions while preserving baseline behavior. As a plug-and-16 play, training-free intervention loop, PROBEACT is orthogonal to existing train-17 ing pipelines. Evaluated on the LIBERO-plus benchmark, our framework acts as18 a universal safety net, improving the success rate of the OpenVLA-OFT model19 from 69.6% to 74.1%, while demonstrating broad applicability to both base and20 fine-tuned VLA policies.
comment: under review
Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions
Autonomous mobile robots operating in tight environments require motion planning frameworks that account for the physical footprint of the robot. Simplifying the geometry to a point or a circle is conservative and discards information needed to successfully and safely traverse narrow passages. This work proposes a safe local motion planning and control method that guarantees that a polytopic robot footprint stays inside a continuously updated convex free-space region. The containment condition is formulated as a set of discrete-time control barrier function constraints within a model predictive controller. The number of safety constraints depends on the complexity of the local free-space geometry and the robot shape, instead of the number of obstacles. The proposed free-space formulation does not need any obstacle detection or segmentation. A comparative analysis against a polytope-based obstacle avoidance formulation confirms favorable scaling up to a reduction of 91$\times$ in computation time as the number of obstacles increases. The approach is validated in simulation with an autonomous surface vehicle and on hardware with a non-holonomic mobile robot, using both occupancy grids and LiDAR sensing. The experiments demonstrate safe real-time motion planning and control at 10~Hz on an onboard embedded computer, including reactive avoidance of dynamic obstacles.
comment: This work has been submitted to the IEEE for possible publication
Modeling Components and Connections in Cyber-Physical Systems
Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.
Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics
Accurate dynamics models are essential for model-based robotic control, yet nominal Euler--Lagrange models often become inaccurate in the presence of payload variation, unmodeled coupling, friction, aerodynamic effects, and changing operating conditions. Most learning-based correction methods improve prediction accuracy by introducing a single additive residual, but do not preserve the internal mechanical structure of Euler--Lagrange systems. This leads to models that do not preserve symmetry, positive-definiteness, or the coupling between inertia and velocity-dependent terms, which can result in physically inconsistent predictions and reduced reliability when embedded in model-based controllers. We propose a structure-preserving residual learning framework that decomposes model mismatch into an inertia correction, the corresponding induced Coriolis term, and a generalized-force residual. The mechanical component is learned under physical constraints, while the disturbance-sensitive component is represented through a sparse history-dependent latent interaction model and adapted online using Bayesian linear regression. This separation preserves key mechanical structure while restricting adaptation to the part of the dynamics most affected by changing conditions. Experiments across multiple robotic platforms, including mobile, aerial, and manipulator systems, show that the proposed method improves dynamics prediction and trajectory tracking under coupled and time-varying dynamics. These results highlight the value of combining structured residual modeling, compact latent interaction selection, and selective online adaptation for real-world model-based control.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.
comment: 19 pages, 7 figures
Motion planning for hundreds of floating robots
Planning collision-free motion for large robot fleets is difficult because collision avoidance induces strong inter-agent coupling that grows rapidly with team size. We consider omnidirectional floating robots on water, where choreographies are specified by sparse keyframes and an interactive tool must generate trajectories within seconds, even when transitions span minutes and thousands of time steps. We propose a scalable pipeline that builds a collision graph from an initialization, decomposes the coupled problem into interaction clusters, and solves clusters independently (and in parallel) with robustness mechanisms for common decomposition pathologies. We validate the approach in simulations up to 500 robots. The synthesized trajectories have also been deployed in two real-world demonstrations, on Lake Zürich with a fleet of 24 Way of Water crafts and at the Time Space Existence 2025 Venice Biennale.
DexPIE: Stable Dexterous Policy Improvement from Real-World Experience
Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.
comment: Project website: https://siiuuuuuu.github.io/DexPIE
Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning
Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.
Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.
Safe-RULE: Safe Reinforcement UnLEarning
Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.
comment: 20 pages, 3 figures
Targeting World Models to Compromise Robot Learning Pipelines
World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.
comment: 8 Pages, CoRL Preprint
Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling
Hindsight relabeling usually turns achieved future states into exact goals, which can overconstrain offline robot learning when task success depends only on a subset of the state. We propose Goal-Set Hindsight Relabeling (GS-HER), a predicate-level generalization of HER in which achieved states certify query-defined goal sets rather than singleton goal states. A binary query specifies which variables define success, making the goal predicate an inference-time input while leaving the underlying offline GCRL algorithm unchanged. Across OGBench tasks and five offline goal-conditioned learners, GS-HER improves performance when full-state goals are bottlenecked by nuisance dimensions and turns hindsight relabeling into a reusable goal interface: one checkpoint can answer multiple robot goal predicates without retraining.
$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models
Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.
Dense Force Estimation with an Event-based Optical Tactile Sensor
Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.
Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer
Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.
comment: 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)
Real-time body pose non-verbal communication with a consistency-based reliability measure
Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.
ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration
Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/
MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry
Robust inertial odometry is essential for various carriers when external sensing is unreliable. Learning-based methods reduce integration drift by capturing local motion priors, but these methods often remain tied to a particular carrier, limiting generalization across heterogeneous platforms. We present MosaicIMU, a carrier-conditioned Mixture-of-Experts (MoE) pretraining-and-adaptation framework for generalizable neural inertial odometry. MosaicIMU uses a prototype-based router to compose carrier-specific expert features, decodes local velocity and uncertainty constraints, and integrates them with a history-aware EKF. For unseen domain adaptation, it freezes the pretrained base model and learns a new lightweight expert residual branch. For edge-deployment, it further reuses the router to select informative online samples for efficient incremental updates. Experiments show that MosaicIMU consistently outperforms learning-based baselines, reducing average ATE and RTE-10s by 40% and 34%, respectively. These results highlight that MosaicIMU provides a scalable pretraining-to-deployment paradigm for generalizable and adaptive neural inertial odometry.
Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification
Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.
TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation
Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines.
KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation
Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.
comment: 14 pages, 7 figures, 6 tables
Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments
Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.
VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands
Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.
comment: Webpage: https://vaic-humanoid.github.io/
VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation
Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.
Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection
Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.
RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)
We present RPO-PDT: a retrieval-grounded, role-play-based dialogue system for adaptive student support in higher education. RPO-PDT is: (1) able to provide institution-specific Personal Development Tutor (PDT) guidance using structured knowledge sources; (2) constrained by explicit persona, boundary, confidentiality, and safety policies; and (3) designed around a reverse-roleplay loop where unresolved interactions are replayed from the student perspective, enabling alternative tutor strategies to be generated and stored as reusable strategy memory. RPO-PDT supports both text-based and Furhat-based embodied interaction for demonstrating grounded, safe, and adaptive student-support dialogue.
comment: 5 pages, 2 figures
Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?
Time-of-flight cameras are popular in robotics for providing direct depth information while being compact, inexpensive, and robust to lighting conditions, but their low spatial resolution and depth noise are widely believed to preclude precise feedback control. In this paper, we show that an inexpensive, low-resolution time-of-flight camera provides sufficient feedback to reliably and precisely balance an inverted pendulum on a cart--a canonical benchmark for fast, unstable dynamics.
Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation ICRA 2026
Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.
comment: Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation
World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.
Deterministic Execution of ROS~2 Applications via Lingua Franca
The Robot Operating System~2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.
Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization
Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.
comment: 16 pages, 13 figures and 6 tables. Submitted to Measurement
Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation
Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly, particle-based simulation is suitable for the relevant policy learning. However, particle simulation is computationally expensive, and repeated excavation cycles further increase the learning cost. We observe that the burial condition of an obstacle governs both task difficulty and simulation cost: deeper burial makes obstacle removal harder while also requiring more particles for accurate simulation. This observation motivates a burial-conditioned curriculum learning strategy. We propose a time-efficient sim-to-real policy learning framework in which the policy observes terrain and obstacle information from RGB-D measurements and then outputs a parameterized excavation trajectory; in this process, the simulator reproduces in a real-world excavator the same observation-action interface it uses under controllable burial conditions. The curriculum begins with shallow burial conditions and progressively increases burial depth while adjusting particle count, thus simultaneously controlling task difficulty and simulation cost. Experiments show that the proposed framework successfully learns an effective obstacle-removal policy, whereas baseline methods fail even after a full week of training. The proposed curriculum achieves effective performance within three days and achieves successful transfer to a real 12-ton excavator operating on open ground with various steel obstacles, thus demonstrating robust obstacle removal.
comment: under review
Bridged SBI: Correcting Biased Low-Fidelity Posteriors for Cost-Efficient High-Fidelity Inference
Accurate calibration of particle-based simulators is crucial for robotic earthwork simulation, but analytical calibration is challenging due to this task's highly nonlinear particle dynamics and the black-box nature of conventional simulators. Although simulation-based inference (SBI) can estimate posterior distributions over simulation parameters solely from forward simulations, applying SBI directly to high-fidelity (HF) particle simulators is often computationally prohibitive. Low-fidelity (LF) simulators with coarser particles can reduce this cost, but changes in particle size and particle count shift the parameter values needed to reproduce the same observation, producing biased LF posteriors. We propose Bridged SBI, which leverages a biased but informative LF posterior to guide HF inference. This method first uses inexpensive LF simulations to identify a coarse high-density parameter region, and then it learns a local residual bridge to transport LF posterior samples toward HF-consistent regions by correcting the LF--HF discrepancy. We analyze how sequential multi-fidelity SBI (Naive-MF) can suffer from LF-induced posterior miscoverage when it directly relies on the LF posterior without discrepancy correction. We then show that Bridged SBI is designed to alleviate this issue by explicitly modeling the LF--HF discrepancy through residual correction. Experiments on both sim-to-sim particle-parameter calibration and real-to-sim calibration with real soil observation show that Bridged SBI produces more accurate and reliable HF posteriors than HF-only SBI or the Naive-MF baseline, especially under limited HF simulation costs.
From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs ICRA 2026
Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.
comment: Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026
RAM: Reachability Across Morphologies
Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.
comment: 22 pages, 11 figures
LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations
Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.
comment: Preprint. Submitted to arXiv
Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask
Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.
ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models
Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.
comment: 13 pages, 3 figures, 6 tables
SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning
Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.
comment: 23 pages, 9 figures, 7 tables
C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache
World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.
PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning
Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.
YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale
We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.
comment: Project page: https://yubi.airoa.io/
What Demonstration Curation Metrics Do to Your Policy
We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.
comment: 6 pages, 1 figure, 2 tables
SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration ICLR 2026
Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.
comment: ICLR 2026
Exploration of Foundation Model-Based Robots in Patient and Elderly Care
Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.
Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs
We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.
comment: 10 pages, 5 figures
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination
World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.
Robotic Nonprehensile Object Transportation with a Hanging Tray
We consider the nonprehensile object transportation task known as the waiter's problem, in which a robot must move an object balanced on a tray from one location to another. In contrast to prior works on the robotic waiter's problem, which make the robot tilt a tray rigidly held by its end effector (EE), we use a tray suspended from the EE by ropes, such that it behaves like a three-dimensional pendulum. Some prior works have actuated the robot so that the EE simulates the behavior of a pendulum, because pendular motion reduces the shear forces acting on the transported objects, minimizing the sliding of rigid objects and sloshing in containers of liquid. In contrast, our use of a real hanging tray allows us to obtain the benefits of pendular motion while only actuating a 3 degree-of-freedom (DOF) mobile base, rather than requiring a full 6-DOF manipulator arm. Our experiments in simulation and on real hardware show that the hanging tray substantially reduces both sliding and sloshing compared to a static, rigidly-grasped tray. Furthermore, we integrate the hanging tray into an interactive robot waiter demonstration, which uses computer vision to identify people with a raised hand and visual servoing to steer toward them and allow them to access the tray.
comment: 8 pages, 11 figures. IEEE/ASME International Conference on Advanced Intelligent Mechatronics, 2026
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation
We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
comment: Accepted at RSS 2026
Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.
comment: 16 pages, 12 figures
Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment
In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at https://anonymous.4open.science/r/UAMP-5638.
DIJIT: A Robotic Head for an Active Observer
We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.
Continuous Reasoning for Vision-Language-Action
Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.
comment: Project page: https://continuous-reasoning.airoa.io
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.
comment: Project page: https://s2.airoa.io
Programmable Deformation Design of Porous Soft Actuator through Volumetric-Pattern-Induced Anisotropy ICRA 2026
Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials such as foam, filled into chambers, can provide structural stability for the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA), to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to $80^{\circ}$ (N=2), tilting of $18^{\circ}$ (N=1), and twisting of $115^{\circ}$ (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft robot hand capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.
comment: Accepted to 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots
Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, which reduced harvesting stability and efficiency.To overcome these issues, this paper proposes a visual fault diagnosis and self-recovery framework. An end-to-end SRR-Net achieved unified perception and fault diagnosis through joint detection, segmentation, and ripeness regression of the fruit and gripper. Leveraging this integrated perception, a relative error compensation method driven by simultaneous target-gripper detection was designed to correct positional misalignments exceeding the tolerance threshold. A micro-optical camera integrated within the end-effector delivered real-time visual feedback. Based on the micro-optical camera, a MobileNet V3-Small classifier was utilized for grasp adjustment during the deflating stage, enabling the early abort of the harvesting cycle in cases of empty grasp/misgrasps. Furthermore, a time-series LSTM classifier was applied during the snap-off stage to predict strawberry slippage. Based on these predictions, the system executed re-inflation and a secondary snap-off attempt for slipping strawberries, or aborted the cycle for slipped strawberries. Experiments demonstrated that the mean absolute errors between the end-effector and the picking point were reduced to 3.12 mm and 4.06 mm from 11.50 mm and 5.25 mm along the x- and y-axes, respectively, at the cost of a time increment of 0.64 $pm$ 0.24 s. The grasp adjustment module reduced the grasping phase by approximately 0.5 s and avoided empty-placement for failure cases. The strawberry slip prediction module handled slipped cases with an 88.89% success rate, saving approximately 4.00 s per harvesting cycle for failure cases. Also, it achieved an 81.25% recovery rate for slipping strawberries, requiring additional 0.63 s for re-grasping.
comment: Accepted by Artificial Intelligence in Agriculture
Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation ICRA 2026
Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://symskill.github.io/ .
comment: ICRA 2026 Best Conference Paper Award; ICRA 2026 Best Paper Award on Planning and Control; CoRL 2025 Best Paper Award on Learning Effective Abstractions for Planning (LEAP) Workshop (https://symskill.github.io/)
Distant Object Localisation from Noisy Image Segmentation Sequences
3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation
Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation
Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich manipulation requires correspondence sensitivity: discriminative feature responses to action-relevant changes in pose, boundary, and contact geometry. Diffusion features provide a strong prior for dense correspondence, but direct use is impractical due to stochasticity, latency, and representation drift. We introduce Robot-DIFT, a deterministic diffusion-derived backbone for real-time control. Through Manifold Distillation, Robot-DIFT converts a noise-conditioned diffusion Teacher into a clean-input, single-pass Student while preserving the teacher's feature manifold. A Spatial--Semantic Feature Pyramid Network (S2-FPN) fuses coarse-to-fine Student decoder features into visual tokens that expose semantic context and fine contact detail to the policy. Across RoboCasa, LIBERO-10, and real robots, Robot-DIFT outperforms vision--language, self-supervised, geometry-oriented, and diffusion baselines on contact-sensitive tasks. Controlled backbone/readout swaps show that S2-FPN unlocks, rather than replaces, the diffusion correspondence prior.
SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online
The online fusion and tracking of static objects from heterogeneous sensor detections is a fundamental problem in robotics, autonomous systems, and environmental mapping. Although classical data association approaches such as JPDA are well suited for dynamic targets, they are less effective for static objects observed intermittently and with heterogeneous uncertainties, where motion models provide minimal discriminative power with respect to clutter. In this paper, we propose a novel method for static object data association by clustering multi-modal sensor detections online (SODA-CitrON), while simultaneously estimating positions and maintaining persistent tracks for an unknown number of objects. The proposed unsupervised machine learning approach operates in a fully online manner and handles temporally uncorrelated and multi-sensor measurements. Additionally, it has a worst-case loglinear complexity in the number of sensor detections while providing full output explainability. We evaluate the proposed approach in different Monte Carlo simulation scenarios and compare it against state-of-the-art methods, including POM-based filtering, DBSTREAM clustering, and JPDA. The results demonstrate that SODA-CitrON consistently outperforms the compared methods in terms of F1 score, position RMSE, MOTP, and MOTA in the static object mapping scenarios studied.
comment: 8 pages, 5 figures; \c{opyright} 2026 IEEE. Accepted for the 2026 International Conference on Information Fusion (FUSION 2026)
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
comment: 22 pages, 9 figures
IMAC-AgriVLN: Can Agricultural Vision-and-Language Navigation Agents be Aware of Instruction Mistakes?
Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extended Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. However, almost all the prior methods adopt an ideal assumption that the given instructions themselves are correct, which does not align with the realistic scenarios, because anybody may say an instruction with mistakes. To bridge this gap, we propose the A2A-MI benchmark, in which we build a semi-automatic data annotator to insert three mistake classifications into each original instruction in a more diversified and efficient way. We test several state-of-the-art agricultural VLN agents on it and observe a sufficient drop with -57% on SR and -9% on NE, from which we suggest that an agricultural VLN agent tends to assume that the given instruction is correct, so does not have the awareness to doubt it when the scenes it sees do not align with the instruction it receives. To build the awareness on instruction mistake, we propose the IMAC module analyzing the instruction and the current front-facing image, to judge whether the instruction has mistakes and attempt to correct it when needed. We integrate IMAC into the baseline model, and observe a noteworthy improvement, sufficiently narrowing the gap to the performance on instructions without mistakes. Project: https://github.com/AlexTraveling/IMAC-AgriVLN.
QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation
Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.
ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning
Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.
comment: Video demo: https://www.youtube.com/watch?v=Ng1alP0yhc0
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves CVPR 2026
Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.
comment: CVPR 2026 Highlight. This version includes the motion retarget process in the appendix
CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.
Goal-oriented Communication for Fast and Robust Robotic Fault Detection and Recovery
Autonomous robotic systems are widely deployed in smart factories and operate in dynamic, uncertain, and human-involved environments that require low-latency and robust fault detection and recovery (FDR). However, existing FDR frameworks exhibit various limitations, such as significant delays in communication and computation, and unreliability in robot motion/trajectory generation, mainly because the communication-computation-control (3C) loop is designed without considering the downstream FDR goal. To address this, we propose a novel Goal-oriented Communication (GoC) framework that jointly designs the 3C loop tailored for fast and robust robotic FDR, with the goal of minimising the FDR time while maximising the robotic task (e.g., workpiece sorting) success rate. For fault detection, our GoC framework innovatively defines and extracts the 3D scene graph (3D-SG) as the semantic representation via our designed representation extractor, and detects faults by monitoring spatial relationship changes in the 3D-SG. For fault recovery, we fine-tune a small language model (SLM) via Low-Rank Adaptation (LoRA) and enhance its reasoning and generalization capabilities via knowledge distillation to generate recovery motions for robots. We also design a lightweight goal-oriented digital twin reconstruction module to refine the recovery motions generated by the SLM when fine-grained robotic control is required, using only task-relevant object contours for digital twin reconstruction. Extensive simulations demonstrate that our GoC framework reduces the FDR time by up to 82.6% and improves the task success rate by up to 76%, compared to the state-of-the-art frameworks that rely on vision language models for fault detection and large language models for fault recovery.
comment: Submit to IEEE for potential publication
Adaptive Artificial Time-Delay Control with Barrier Lyapunov Constraints for Euler-Lagrange Robots
This paper addresses the challenge of simultaneously compensating for state-dependent uncertainties and enforcing time-varying state constraints in Euler-Lagrange systems, a common requirement in robotics that remains underserved by existing control designs. A novel adaptive control framework is developed that combines an artificial time-delay-based uncertainty estimation strategy, also known as time-delay estimation, with a barrier Lyapunov function to enforce constraint-aware control design. Specifically, a state-dependent upper bound on the time-delay estimation approximation error is analytically formulated, and an adaptive law is constructed to estimate its parameters online, enabling real-time state-dependent uncertainty compensation without relying on prior model knowledge. To ensure constraint compliance, the barrier Lyapunov function-based controller enforces time-varying bounds on both position and velocity. The resulting architecture is provably stable via Lyapunov analysis. Experimental results on a five-degree-of-freedom robotic manipulator validate the framework's capability, compared with the state of the art, in maintaining strict adherence to safety-critical constraints under dynamic uncertainties.
Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty
Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.
Multiagent Systems
FASE: Fast Adaptive Semantic Entropy for Code Quality
Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.
Revisiting mesoscopic traffic flow simulation in SUMO: Limitations, analysis, and an alternative
Mesoscopic traffic flow models combines the merits of both macroscopic and microscopic models by capturing individual vehicle behavior in great detail and remaining the computational efficiency. At the time of this study, the mesoscopic model proposed by Eissfeldt (2004) is used in Simulation of Urban MObility (SUMO). The movement of vehicles is governed by dynamic headways between edges. However, the model does not fully comply with the principle of the Lighthill-Whitham-Richards (LWR) model. Several problems are identified, including the incomplete consideration of queue dynamics and the limited implementation of backward traveling spaces. Two case study scenarios demonstrate that the problems lead to unrealistic onset and recovery pattern of congestion. The magnitude of congestion is generally underestimated with this model. To address these drawbacks, a proper mesoscopic discrete-time implementation of link transmission model, which follows the LWR principle, is proposed. By explicitly incorporating backward traveling spaces to capture queue spillback phenomena, the proposed model provides a more precise representation of congestion dynamics. The link density outputs are consistent with the kinematic wave theory and the microscopic traffic simulation in SUMO, thus verifying its theoretical accuracy.
comment: Presentation at SUMO Conference 2026
Performance Evaluation of Social Learning
Social Learning is a decentralized decision-making paradigm in which spatially dispersed agents collect streaming observations regulated by one of a finite number of models (the hypotheses). The agents are interested in assigning probability scores (the beliefs) to the possible hypotheses. To this end, the agents exchange their beliefs according to a certain communication graph. It has been shown that, under reasonable conditions on the identifiability of the decision model and the network connectivity, each agent ultimately places all the belief mass on the true hypothesis governing the data. However, several questions remain unanswered regarding the evaluation of the social learning performance. One recently adopted performance metric is the rejection rate, i.e., the rate at which the beliefs about the erroneous hypotheses vanish. One contribution of this work is to establish that the rejection rate leads to several paradoxes, which make it unsuitable as a valid performance measure. We then focus on studying the error probability measure. For a binary Gaussian problem, we derive an analytical formula characterizing the ratio between the individual agents' probabilities and the optimal Bayesian probability. The formula shows that this ratio is expressed by the product of two terms quantifying the effect of the network connectivity and the role of the prior information. As a result, an irreducible gap emerges between the decentralized and the centralized error probabilities, which is agent-dependent and does not disappear asymptotically.
comment: This work has been submitted to the IEEE for possible publication
Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations
Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.
comment: 7 pages, 6 figures
A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach
Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.
comment: 26 pages, 21 figures
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.
Deployment-Time Memorization in Foundation-Model Agents ICML
Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.
comment: 4 pages, ICML MemFM 2026 Workshop
MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics
Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.
XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems ICRA
Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.
comment: Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)
Pathfinders in the Sky: Formal Decision-Making Models for Collaborative Air Traffic Control in Convective Weather
Air traffic can be significantly disrupted by weather. Pathfinder operations involve assigning a designated aircraft to assess whether airspace that was previously impacted by weather can be safely traversed through. Despite relatively routine use in air traffic control, there is little research on the underlying multi-agent decision-making problem. We seek to address this gap herein by formulating decision models to capture the operational dynamics and implications of pathfinders. Specifically, we construct a Markov chain to represent the stochastic transitions between key operational states (e.g., pathfinder selection). We then analyze its steady-state behavior to understand long-term system dynamics. We also propose models to characterize flight-specific acceptance behaviors (based on utility trade-offs) and pathfinder selection strategies (based on sequential offer allocations). We then conduct a worst-case scenario analysis that highlights risks from collective rejection and explores how selfless behavior and uncertainty affect system resilience. Empirical analysis of data from the US Federal Aviation Administration demonstrates the real-world significance of pathfinder operations and informs future model calibration.
Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization ACL
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.
comment: Accepted at ACL SRW 2026
ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning
LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.
comment: Code Implementation: https://github.com/sbhakim/anneal-agents
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.
CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems
Multi-agent systems (MAS) are increasingly capable of tackling complex real-world tasks, yet their reliance on inter-agent coordination, tool use, and long-horizon reasoning makes error recognition particularly challenging. Minor errors can propagate across agents, escalating into task failures while producing long, intertwined execution trajectories that impose significant costs for both human developers and automated systems to debug and analyze. Our key insight is that, despite surface differences in failure trajectories (e.g., logs), MAS errors often recur with similar structural patterns. This paper presents CORRECT, the first lightweight, training-free framework that leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests. This cache-based reuse allows LLMs to perform targeted error localization at inference time, avoiding the need for expensive retraining while adapting to dynamic MAS deployments in subseconds. To support rigorous study in this domain, we also introduce CORRECT-Error, a large-scale dataset of over 2,000 annotated trajectories collected through a novel error-injection pipeline guided by real-world distributions, and further validated through human evaluation to ensure alignment with natural failure patterns. Experiments across seven diverse MAS applications show that CORRECT improves step-level error localization up to 19.8% over existing advances while at near-zero overhead, substantially narrowing the gap between automated and human-level error recognition.
Decentralized Value Systems Agreements AAMAS 2026
One of the biggest challenges of value-based decision-making is dealing with the subjective nature of values. The relative importance of a value for a particular decision varies between individuals, and people may also have different interpretations of what aligning with a value means in a given situation. While members of a society are likely to share a set of principles or values, their value systems--that is, how they interpret these values and the relative importance they give to them--have been found to differ significantly. This work proposes a novel method for aggregating value systems, generating distinct value agreements that accommodate the inherent differences within these systems. Unlike existing work, which focuses on finding a single value agreement, the proposed approach may be more suitable for a realistic and heterogeneous society. In our solution, the agents indicate their value systems and the extent to which they are willing to concede. Then, a set of agreements is found, taking a decentralized optimization approach. Our work has been applied to identify value agreements in two real-world scenarios using data from a Participatory Value Evaluation process and a European Value Survey. These case studies illustrate the different aggregations that can be obtained with our method and compare them with those obtained using existing value system aggregation techniques. In both cases, the results showed a substantial improvement in individual utilities compared to existing alternatives.
comment: Accepted at AAMAS 2026 (Submission 1181)
Systems and Control (EESS)
An Agency-Transferring Model-Free Policy Enhancement Technique
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.
Optimal Feedback Communication with Information Maximization and Distortion Minimization
We study the problem of optimally sending a real-valued source through multiple uses of a channel with feedback. First, we state a set of conditions that are sufficient for an encoder to achieve maximal mutual information between the source and all the channel outputs. This set of conditions are also necessary when the channel is input-identifiable, a condition widely satisfied by common channel models. More notably, we further study the information maximization-distortion minimization problem, where the mutual information between the source and all channel outputs still needs to be maximized, while at each step, the MMSE of estimating the source from the channel outputs so far also needs to be minimized. We derive a solution to this problem for discrete channels with certain symmetries, e.g. $k$-ary symmetric or $k$-ary erasure channels. We show that for such channels the famous posterior matching scheme, while not necessary for information maximization alone, is sufficient and essentially necessary for achieving both information maximization and distortion minimization. This work also provides a new perspective of regularizing distortion-minimizing feedback communication through information maximization, which enables us to find the optimal solution that otherwise would be intractable.
Motion planning for hundreds of floating robots
Planning collision-free motion for large robot fleets is difficult because collision avoidance induces strong inter-agent coupling that grows rapidly with team size. We consider omnidirectional floating robots on water, where choreographies are specified by sparse keyframes and an interactive tool must generate trajectories within seconds, even when transitions span minutes and thousands of time steps. We propose a scalable pipeline that builds a collision graph from an initialization, decomposes the coupled problem into interaction clusters, and solves clusters independently (and in parallel) with robustness mechanisms for common decomposition pathologies. We validate the approach in simulations up to 500 robots. The synthesized trajectories have also been deployed in two real-world demonstrations, on Lake Zürich with a fleet of 24 Way of Water crafts and at the Time Space Existence 2025 Venice Biennale.
Powering the Future of AI: Navigating the Trade-offs for Europe's Energy Transition and Net-Zero Goals
The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structurally challenging component in power system planning and operation. Using a spatially explicit optimisation model of Europe across 21 AI growth scenarios, we systematically quantify additional demand, capacity requirements, emissions, and operational impacts of DCs. Results indicate that AI could drive 73-723 TWh of extra demand by 2050, risking cumulative emissions overshoots of 67-181 MtCO2 between 2030 and 2050. Our analysis indicates that after 2030, the geography of AI infrastructure will be shaped more by firm power and system flexibility than by the mere abundance of clean energy. In moderate scenarios, AI requires an additional of 200 hours of firm generation, which increases LCOE by 35 EUR/MWh in key hubs. We show that even under the pessimistic scenarios, existing infrastructure would require 70 GW additional capacity, while under managed growth pathways, this expansion could reach 226 GW. We further find DCs workload dynamics strongly shape energy dispatch, system flexibility, and emissions, while improved efficiency significantly reduces capacity needs, and system peaks. While our findings suggest that net-zero targets for 2050 may be achieved, critical emission risks may appear in the intermediate years, and the EU may compromise its carbon-neutral goals unless policies adapt to this accelerating digital transformation.
A Continuification Approach to CAV Control in Mixed Traffic via Variable Speed Limits SC 2026
This paper presents a method for controlling traffic via the use of connected and automated vehicles (CAVs) acting as moving bottlenecks. Current methods for moving bottleneck control use a couple PDE-ODE model, based on the Lighthill-Whitham-Richard (LWR) model, to represent the influence of the CAV. Control of the CAV is normally achieved by designing the control on the ODE which models the speed of the moving bottleneck. The proposed method in this paper instead looks to reduce the computational burden of controlling multiple CAVs by designing the moving bottleneck controller first on the PDE. The original control designed on the PDE is a linear quadratic regulator (LQR) that determines the optimal variable speed limit (VSL) for the entire length of freeway in order to regulate density to a desired setpoint. Then, a continuification approach is utilized to determine the input speed for each CAV. Results show that multiple CAVs can be controlled via this method, with minimal computational burden, and that as the number of CAVs increases the solution approaches the global optimal solution determined by the LQR.
comment: 7 pages, 5 figures. Accepted to IEEE ITSC 2026, Naples, Italy
Guaranteed Fast Implementation of the Split Covariance Intersection Filter: Nested Newton Method Thanks to the Fourth-Order Convexity of w-Optimization
The split covariance intersection filter (Split CIF) is a useful tool for general data fusion and has the potential to be applied in a variety of engineering tasks. The w-optimization problem involved in the Split CIF concerns the performance and implementation efficiency of the Split CIF. It is known that the w-optimization problem enjoys the desirable property of convexity (or more clearly, the second-order convexity in this paper's context). This paper proves that the w-optimization problem further enjoys a more desirable property namely the fourth-order convexity, thanks to which a guaranteed fast implementation of the Split CIF can be realized. The new implementation is coined as the nested Newton method, which is also presented in this paper.
Tracking the Effective Surface Area of Non-Convex Satellites
This paper presents a novel framework to track the effective surface area of non-convex satellites, enabling the use of aerodynamic drag in low Earth orbit for orbital control. The proposed framework enables the satellite to track the effective surface area while simultaneously performing other maneuvers. We introduce this framework through a backstepping control algorithm, and exemplify its advantages with an extension, to simultaneously maximize solar panel exposure. The equilibria of the closed-loop systems are shown to be asymptotically stable, and simulation results confirm the effectiveness of the proposed framework.
comment: 6 pages, 5 Figures
Leveraging Optimal Information-Power Flow for Transmission Switching in AC/MTDC Grids
The emerging AC/multi-terminal DC grids are regarded as a promising solution for accommodating the increasing integration of renewable energy sources. This work proposes an optimization framework to address transmission switching (TS) problems arising in practical operational scenarios, such as maintenance scheduling, contingency management, and fault restoration. Unlike most existing studies, the proposed framework considers the role of communication networks in TS operations and develops an optimal information-power flow (OIPF) model. The OIPF model captures the impact of information flows on circuit breaker actions while incorporating communication-related costs, thereby better reflecting practical operational decision-making processes. To ensure computational tractability, the resulting optimization problem is formulated as a mixed-integer second-order cone programming (MISOCP) model through convex relaxations, polygonal approximations, and Big-M reformulations. Numerical case studies illustrate the applicability of the proposed OIPF model and indicate its potential in supporting transmission switching decisions.
comment: 6 pages
Delayed Functional Observers for Output-Delayed Linear Systems
This paper introduces a novel class of delayed functional observers specifically designed to reconstruct delayed control laws under severe output measurement lags, directly complementing recent literature \cite{trinhnn26, trinhnam26}. By systematically mitigating simultaneous, unequal delays across both the actuator and sensor channels, the proposed architecture resolves dual-channel latency without requiring full-state estimation or computationally intensive real-time distributed integration. Ultimately, this work provides a powerful, low-order framework that bridges the gap between idealized control theory and the practical constraints of modern networked engineering systems.
comment: Short version of a chapter intended for a forthcoming research monograph
Advanced simulation framework for AC/MTDC power systems
Alternating current (AC)/multi-terminal direct current (MTDC) hybrid power systems (HPSs) play a crucial role in enabling long-distance power transmission and flexible interconnections between AC grids. However, the challenges that HPSs encountered are numerous, with stability and harmonic issues being particularly prominent. Traditional electromagnetic transient (EMT) tools have struggled to accommodate small-signal stability problems and the potential issues of the optimal interactions among converters. To address this gap, HARMONY ("HARMONic stabilitY assessment of PE-penetrated power systems") has been developed for the advanced simulation and analysis of interconnected AC/MTDC HPSs as a comprehensive mathematical framework based on C++ programming language. The primary goals of Harmony are to provide faster and trusted stability analyses, and address the analytical difficulties associated with converter control dynamics, converter-driven stability, and interoperability in HPSs. This framework is intended to be open source, therefore broadening collaboration for researchers, and to contribute to the community of power systems engineers. In this paper, we demonstrate two core functionalities featured in HARMONY, that are optimal power flow (OPF) and harmonic stability analyses (HAS). The underlying analysis models and computational methodologies for both functionalities are presented in detail to help future readers and users gain a clear understanding of mathematical fundamentals of HARMONY. Furthermore, we introduce the integrated framework of OPF and HAS designed in HARMONY, along with representative printed analysis results, to demonstrate the appealing capabilities of HARMONY.
comment: 13 pages
Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments
Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.
Revisiting mesoscopic traffic flow simulation in SUMO: Limitations, analysis, and an alternative
Mesoscopic traffic flow models combines the merits of both macroscopic and microscopic models by capturing individual vehicle behavior in great detail and remaining the computational efficiency. At the time of this study, the mesoscopic model proposed by Eissfeldt (2004) is used in Simulation of Urban MObility (SUMO). The movement of vehicles is governed by dynamic headways between edges. However, the model does not fully comply with the principle of the Lighthill-Whitham-Richards (LWR) model. Several problems are identified, including the incomplete consideration of queue dynamics and the limited implementation of backward traveling spaces. Two case study scenarios demonstrate that the problems lead to unrealistic onset and recovery pattern of congestion. The magnitude of congestion is generally underestimated with this model. To address these drawbacks, a proper mesoscopic discrete-time implementation of link transmission model, which follows the LWR principle, is proposed. By explicitly incorporating backward traveling spaces to capture queue spillback phenomena, the proposed model provides a more precise representation of congestion dynamics. The link density outputs are consistent with the kinematic wave theory and the microscopic traffic simulation in SUMO, thus verifying its theoretical accuracy.
comment: Presentation at SUMO Conference 2026
Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?
Time-of-flight cameras are popular in robotics for providing direct depth information while being compact, inexpensive, and robust to lighting conditions, but their low spatial resolution and depth noise are widely believed to preclude precise feedback control. In this paper, we show that an inexpensive, low-resolution time-of-flight camera provides sufficient feedback to reliably and precisely balance an inverted pendulum on a cart--a canonical benchmark for fast, unstable dynamics.
Modeling of Spinning Plates: Geometric Stiffening and Modal Approximation for GNC Applications
This work presents a modal formulation for flexible rectangular plates, accounting for nonlinear geometric effects arising from in-plane foreshortening and centrifugal stiffening. The model is linearized with respect to elastic deformations while retaining the full dependence on spacecraft angular velocities and accelerations. System matrices depend nonlinearly on spacecraft states through squared and cross-product terms, capturing gyroscopic coupling and dynamic stiffening phenomena for arbitrary rotational maneuvers. Polynomial approximation of mode shapes enables efficient computation while preserving accuracy. Model predictions are validated against finite element simulations and literature data for transient response under prescribed hub motion.
comment: Conference paper for the 28th AIDAA International Congress and the 10th CEAS Aerospace Europe Conference, held from December 1 to 4 2025. 6 Pages, 4 Figures
Families of Control-Cost-Parametrized Inverse-Optimal Universal Stabilizers
A classical universal stabilization formula offers the practitioner no design freedom: it is a single, parameter-free object. We introduce a cost-parametrized family of stabilizing feedback laws, where (1) the user chooses a function that serves as the running cost on control in an inverse-optimal cost functional, and (2) obtains, through a formula, a nonlinear "expander" of a pre-existing universal controller, which solves an infinite-horizon optimal control problem with a meaningful cost on the state. The cost-to-expander formula is a three-step construction, involving, inter alia, cost differentiation and function inversion-overall, a nonlinear infinite-dimensional operator. The cost-to-expander operator is proven Lipschitz, which enables uniform neural operator approximation of the entire family and supports both offline performance exploration and online adaptation. Semiglobal practical asymptotic stability and second-order suboptimality bounds are established under the approximation. The operator learning and its use in semiglobal stabilization are illustrated numerically. We call the result 'half-direct-optimal' because the paper's design is less than a general 'direct optimal' (HJB-inducing) control, but more than the fully inverse optimal, since the user performs minimization for an arbitrary given cost on control. The dual to the half-direct problem we solve is the problem in which the cost on the state is arbitrary and given. This dual problem is easier and outside of the scope of the paper.
comment: 13 Pages
Seamless Contraction-Control Framework for Unplanned Grid-Connected/Stand-Alone Transitions of Grid-Forming Inverters
Unplanned grid-connected (GC)/stand-alone (SA) transitions commonly occur in AC microgrids during protection trips, manual breaker operation, or low-bandwidth supervisory communication. Under such unplanned transitions, a grid-forming inverter must support the local-load voltage in stand-alone operation and regulate the desired power/current injection in grid-connected operation. Existing P--Q droop-based seamless-transfer methods often rely on planned transition commands, supervisory islanding detection, or pre-synchronization interval, which may prevent timely voltage/current support during unplanned bidirectional transitions. To address this problem, this paper proposes a seamless contraction-control (SCC) framework for target dynamics. Using the SCC, contraction-based grid-connected current-control and stand-alone voltage-control laws are proposed. With the new control laws, the inverter achieves transient stability and converges to the target trajectory with a prescribed convergence rate. Furthermore, a breaker-status observer is proposed to infer the grid-connected/stand-alone mode from voltage measurements on both sides of the breaker, eliminating the need for a dedicated pre-synchronization interval or supervisory islanding detection process and enabling timely voltage/current support during unplanned transitions. Experimental results validate that the proposed method achieves stand-alone voltage support, stable grid-connected current injection under symmetrical/unsymmetrical grid-voltage sag and phase-jump disturbances, and unplanned bidirectional transitions.
comment: 10 pages, 9 figures
LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization
We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned model that preserves convexity and smoothness. This leads to the proposed Moreau Envelope Learning ADMM (MEL-ADMM) and its splitting variant sMEL-ADMM. Unlike existing approaches that learn high-dimensional operators directly, LEAF learns a scalar-valued Moreau envelope, significantly reducing model complexity and improving data efficiency. The framework accommodates a broad class of convex problems with smooth and non-smooth objectives. By embedding convexity explicitly through the ICNN architecture, the proposed approach maintains high approximation accuracy while preserving key structural properties of the optimization problem. Both MEL-ADMM and sMEL-ADMM are developed with theoretical guarantees of convergence and feasibility under the learned model. Rigorous analysis shows that the proposed methods achieve convergence rates comparable to classical ADMM while reducing per-iteration computational cost. Numerical experiments demonstrate up to an order-of-magnitude speedup over state-of-the-art solvers while maintaining low optimality gaps
Not All Warm Starts Help: Benchmarking Primal-Dual Initializations for ACOPF Algorithms
Warm starts are widely used to accelerate AC optimal power flow (ACOPF) solves, but the impact of different initialization strategies has received limited systematic study, particularly for the primal-dual interior-point methods that dominate large-scale ACOPF algorithms. This paper benchmarks initialization strategies for ACOPF solved with the interior-point solver IPOPT on 19 PGLib-OPF instances (5 to 30,000 buses), testing all 15 non-empty subsets of the primal blocks $\{P_g, Q_g, V_m, V_a\}$ under oracle conditions and three DC-seeded combinations in a practical setting. The experiments show that most partial primal-plus-dual restarts increase solve time or reduce convergence reliability. Among the oracle primal-plus-dual (O-PD) configurations, only the complete restart reliably converges on every baseline-convergent case, reaching a $47.6\%$ median solve-time speedup. Twelve of the 14 partial O-PD combinations have negative median speedups, and several fail repeatedly on larger networks. Decomposing the dual into constraint and bound multipliers shows that \emph{coverage}, not the presence of duals per se, governs robustness: the full bound-multiplier vector reaches 90.7\% convergence and a $+26.8$\% median speedup, whereas block-matched coverage (oracle multipliers on some bounds, defaults on the rest) drops to 70.4\% and $-31.1$\%. Practical DC seeding sometimes helps the AC solve, but the benefit is no longer statistically significant once the DCOPF presolve cost is included in the end-to-end comparison ($p = 0.4171$). For learned warm-start methods, the results support the following target ordering: predict the full primal vector first; if only partial coverage is possible, prioritize voltage variables; and avoid partial or inconsistent dual predictions unless the primal estimate is nearly complete.
Game-Theoretic Area Coverage Control with Cooperative-Adversarial Multi-Agent Systems
We formulate a multi-agent area coverage control problem as a two-player zero-sum game between two agent groups with conflicting goals. Conventional coverage control allocates resources based on an environmental risk density field. In contrast, we generalize this metric by allowing a second group of adversarial agents to generate the spatial risk field. Coupled agent dynamics are linked through the area coverage metric, which functions as the game reward. This framework induces coupled gradient-descent-ascent controllers for the groups. Analysis of a low-dimensional case reveals a Hopf bifurcation dictated by the ratio of the groups' control gains. In the regime dominated by adversarial agents, the system is driven into a periodic chase-evasion cycle. In the regime dominated by ordinary agents, the system converges to a fixed configuration. Numerical simulations validate these theoretical insights. Finally, we characterize the Nash equilibrium conditions. Under this equilibrium, ordinary agents converge to a generalized centroidal Voronoi tessellation, whereas adversarial agents settle at their corresponding equilibrium centroids.
comment: This work has been submitted to IFAC for possible publication
Submodular Optimization with Applications to Decision and Control
Submodular set functions, characterized by the diminishing-returns property, provide a unifying combinatorial framework for many subset-selection problems in decision and control. Although exact maximization is NP-hard in general, the structural properties of submodular functions enable simple greedy algorithms that achieve constant-factor approximation guarantees for monotone objectives, with randomized greedy-based variants extending such guarantees to the non-monotone case. This survey reviews the theory, algorithms, and applications of submodular optimization with a focus on systems and control. We cover the structural properties of submodular functions, including curvature and the submodularity ratio, the constraint families that arise in practice (matroids, knapsack, and $p$-systems), and the main approximation algorithms for monotone and non-monotone submodular maximization, with up-to-date approximation ratios and hardness results. We then survey applications across sensor scheduling, multi-agent coordination, robust submodular optimization, leader-follower systems, distributed submodular optimization, game theory, system theory, resource allocation, social networks, and informative path planning. The survey emphasizes practically implementable greedy-based algorithms and instance-dependent refinements via curvature and the submodularity ratio. We close with observations on canonical control-theoretic objectives: certain functionals are submodular (the log-determinant and rank of the controllability Gramian, and the log-determinant of the Kalman filter information matrix), whereas closely related objectives fail to be sub- or supermodular (the steady-state Kalman filter error covariance, and the average control energy obtained from the inverse Gramian). We also highlight the cross-cutting open directions that follow.
An Algebraic State Observer for a Self-Sensing Active Magnetic Bearing System
The problem of designing a globally stable observer for a self-sensing active magnetic bearing system assuming only measurements of currents and voltages is addressed in this paper. Towards this end, we first design a radically different, high performance, state observer, which is obtained invoking novel techniques. Indeed, our objective is to obtain an algebraic relation between the unmeasurable part of the state and filtered versions of the systems inputs and outputs, which holds for all times. Then, using this algebraic observer, we propose a robust asymptotic version of the observer. Simulation results that illustrate the performance of the observer are also presented.
comment: 7 pages, 9 figures, submitted to International Journal of Adaptive Control and Signal Processing
Nonlinear Estimator: Dual Bayesian Affine Estimators for Parameter Learning
This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component's plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.
comment: 32 pages, 9 figures
A constrained symbolic regression approach for Lyapunov function discovery
In this paper, we consider the data-driven discovery of Lyapunov functions for autonomous dynamical systems. We represent the Lyapunov function as an expression tree of fixed depth and formulate the Lyapunov discovery task as a constrained self-supervised symbolic regression problem. The constraints model the output of the Lyapunov function for a given input as well as the Lyapunov stability conditions. This modeling approach makes no a priori assumptions about the functional form of the Lyapunov function, is inherently interpretable since the function is obtained in a symbolic form, and, in principle, can be applied to any continuous dynamical system. We also develop a tailored branch-and-bound-and-check solution approach to efficiently solve the resulting learning task. Applications to several case studies show the ability of the proposed approach to discover Lyapunov functions.
Geometric Time-Domain Identification of Three-Phase Load Equivalents from Terminal Measurements
This paper presents a geometric time-domain method for identifying three-phase load equivalents from instantaneous voltage and current measurements at the point of common coupling. Measured waveforms are interpreted as trajectories in Euclidean signal spaces, and load-equivalent parameters are recovered from the geometry of those trajectories. The method extends a previously published single-phase geometric identification formulation to three- and four-wire systems and places special emphasis on the three-wire case, where no neutral voltage is measured and the terminal data must satisfy coupled Kirchhoff constraints. The main advance over the earlier analytical formulation is a sampled-data implementation based on local time windows, normalized matrix equations, harmonic-projection derivative and primitive coordinates, explicit geometric identifiability tests, passivity constraints, and energy/Kirchhoff residuals. The method does not force a model when the measured trajectory lacks enough information; instead, it reports low-rank or ill-conditioned windows as low-confidence evidence. Numerical simulations with clean data, measurement noise, window-length sweeps, and sensor delay show that the method accurately identifies informative three-phase trajectories and exposes structurally degenerate cases such as pure single-frequency excitation for higher-order three-wire models. For a given admissible topology the identified circuit closes the instantaneous terminal energy balance of the measured load over the analysis window.
Notes on data-driven output-feedback control of linear MIMO systems
Recent works have approached the data-driven design of dynamic output-feedback controllers for discrete-time LTI systems by constructing non-minimal state vectors composed of past inputs and outputs. Depending on the system's complexity (order $n$, lag $\ell$ and number of outputs $p$), it was observed in several works that such an approach presents significant limitations. In particular, many works require to restrict the class of LTI systems to those satisfying the relation $p\ell=n$. In this note, we show how to address the general MIMO case (for which $p\ell\geq n$ in general) by constructing an alternative non-minimal state vector from data. Different from the existing literature, our method guarantees the satisfaction of certain rank conditions when the system is persistently excited, thereby facilitating the direct data-driven dynamic output-feedback control of MIMO systems by applying methods that were originally developed for the input-state data setting.
comment: updated IEEE copyright notice
Branch-Level Energy Localization in Three-Phase Loads: Resolving Indeterminacy in Time-Domain
This paper develops a branch-level energy-localization framework for three-phase loads. The instantaneous terminal power of an admissible lumped equivalent is decomposed uniquely as Joule dissipation plus magnetic and electric stored-energy rates, branch by branch. Three formal results are established: a Branch-Level Localization Theorem (uniqueness given an admissible topology); a Topology-Indeterminacy Theorem (multiple admissible topologies reproduce identical terminal data with distinct localizations); and a Generalized Energetic Duality Theorem that organizes classical electrical dualities (Norton-Thevenin, series--parallel, L vs C, R vs G) as restrictions to Linear Time Invariant (LTI) sinusoidal regimes of a single time-domain principle in which constant-parameter equivalence is replaced by time-varying parameters. The framework is exercised on six test cases including the de Leon--Cohen open-phase paradox, switched-resistive loads, three-wire delta-versus-wye-virtual indeterminacy, fluctuating-phase loads, and a four-wire nonlinear load with hysteretic, linear, and switched branches. The framework is positioned as complementary to IEEE Std. 1459, CPC, instantaneous p-q, and Fryze-Buchholz-Depenbrock: each answers a different question, and the apparent paradoxes vanish once the question is posed precisely.
Robust and efficient data-driven predictive control
We propose a robust and efficient data-driven predictive control (eDDPC) scheme which is more sample efficient (requires less offline data) compared to existing schemes, and is also computationally efficient. This scheme employs a recently proposed data-based representation of linear time-invariant (LTI) systems as a predictor. Such a representation serves as an alternative to Hankel-based predictors obtained from, e.g., the so-called fundamental lemma, and can be derived by exploiting the kernel structure of shallow Hankel matrices of data. This allows for application of our proposed scheme using very short (and potentially irregularly measured) noisy input-output data, the amount of which is independent of the prediction horizon. To account for measurement noise, we provide a novel result that quantifies the uncertainty between the true (unknown) restricted behavior of the system and the estimated one from noisy data. Furthermore, we show that the robust eDDPC scheme is recursively feasible and that the resulting closed-loop system is practically exponentially stable. Finally, we compare the performance of this scheme to existing ones on a case study of a four tank system.
comment: 17 pages, 3 figures
Chaos-Free Networks are Stable Recurrent Neural Networks
Gated Recurrent Neural Networks (RNNs) are widely used for nonlinear system identification due to their high accuracy, although they often exhibit complex, chaotic dynamics that are difficult to analyze. This paper investigates the system-theoretic properties of the Chaos-Free Network (CFN), an architecture originally proposed to eliminate the chaotic behavior found in standard gated RNNs. First, we formally prove that the CFN satisfies Input-to-State Stability (ISS) by design. However, we demonstrate that the CFN architecture does not intrinsically guarantee Incremental ISS (delta-ISS), as ensuring this property relies on specific parametric constraints. To address this, we introduce the Decoupled-Gate Network (DGN), a novel structural variant of the CFN that removes internal state connections in the gating mechanisms. Finally, we prove that the DGN unconditionally satisfies the delta-ISS property, providing an incrementally stable architecture for identifying nonlinear dynamical systems without requiring complex network training modifications. Numerical results confirm that the DGN maintains the modeling capabilities of standard architectures while adhering to these rigorous stability guarantees.
comment: Accepted for publication in IEEE Control Systems Letters (L-CSS). 6 pages, 2 figures
Artificial-reference tracking MPC with probabilistically validated performance on industrial embedded systems
Industrial embedded systems are typically used to execute simple control algorithms due to their low computational resources. Despite these limitations, the implementation of advanced control techniques such as Model Predictive Control (MPC) has been explored by the control community in recent years, typically considering simple linear formulations or explicit ones to facilitate the online computation of the control input. These simplifications often lack features and properties that are desirable in real-world environments. This article presents an efficient implementation for embedded systems of MPC for tracking with artificial reference, solved via a recently developed structure-exploiting ADMM-based algorithm. This formulation is tailored to a wide range of applications by incorporating essential practical features at a small computational cost, including integration with an offset-free scheme, back-off parameters that enable constraint tightening, and soft constraints that preserve feasibility under disturbances or plant-model mismatch. This is accompanied with a framework for probabilistic performance validation of the closed-loop system over long-term operation. The applicability of the approach is illustrated on a Programmable Logic Controller (PLC), incorporated in a hardware-in-the-loop setup to control a nonlinear continuous stirred-tank reactor. The behavior of the closed-loop system is probabilistically validated with respect to constraint violations and the number of iterations required at each time step by the MPC optimization algorithm.
comment: 15 pages, 25 figures
Exposing Barriers to Flexibility Aggregation in Unbalanced Distribution Networks
The increasing integration of distributed energy resources (DER) offers new opportunities for distribution system operators (DSO) to improve network operation through flexibility services. To utilise flexible resources, various DER flexibility aggregation methods have been proposed, such as the concept of aggregated P-Q flexibility areas. Yet, many existing studies assume perfect coordination among DER and rely on single-phase power flow analysis, thus overlooking barriers to flexibility aggregation in real unbalanced systems. To quantify the impact of these barriers, this paper proposes a three-phase optimal power flow (OPF) framework for P-Q flexibility assessment, implemented as an open-source Julia tool 3FlexAnalyser.jl. The framework explicitly accounts for voltage unbalance and imperfect coordination among DER in low voltage (LV) distribution networks. Simulations on an illustrative 5-bus system and a real 221-bus LV network in the UK reveal that over 30% of the theoretical aggregated flexibility potential can be lost due to phase unbalance and lack of coordination across phases. These findings highlight the need for improved flexibility aggregation tools applicable to real unbalanced distribution networks.
Autonomous Air-Ground Vehicle Operations Optimization in Hazardous Environments: A Multi-Armed Bandit Approach
Hazardous environments such as chemical spills, radiological zones, and bio-contaminated sites pose significant threats to human safety and public infrastructure. Rapid and reliable hazard mitigation in these settings often unsafe for humans, calling for autonomous systems that can adaptively sense and respond to evolving risks. This paper presents a decision-making framework for autonomous vehicle dispatch in hazardous environments with uncertain and evolving risk levels. The system integrates a Bayesian Upper Confidence Bound (BUCB) sensing strategy with task-specific vehicle routing problems with profits (VRPP), enabling adaptive coordination of unmanned aerial vehicles (UAVs) for hazard sensing and unmanned ground vehicles (UGVs) for cleaning. Using VRPP allows selective site visits under resource constraints by assigning each site a visit value that reflects sensing or cleaning priorities. Site-level hazard beliefs are maintained through a time-weighted Bayesian update. BUCB scores guide UAV routing to balance exploration and exploitation under uncertainty, while UGV routes are optimized to maximize expected hazard reduction under resource constraints. Simulation results demonstrate that our framework reduces the number of dispatch cycles to resolve hazards by around 30% on average compared to uninformed baseline dispatch strategies, underscoring the value of uncertainty-aware vehicle dispatch for reliable hazard mitigation.
A two-disk approach to the synthesis of coherent passive equalizers for linear quantum systems
The coherent equalization problem consists in designing a quantum system acting as a mean-square near-optimal filter for a given quantum communication channel. The paper develops an improved method for the synthesis of transfer functions for such equalizing filters, based on a linear quantum system model of the channel. The method draws on a connection with the two-disk problem of ${H}_{\infty}$ control for classical (i.e., non-quantum) linear uncertain systems. Compared with the previous methods, the proposed method applies to a broader class of linear quantum communication channels.
comment: 20 pages, 9 figures
Optimal Finite-Horizon LQR Control for Traffic Flow via Variable Speed Limits
This article presents a finite-horizon linear quadratic regulator for the control of the first-order Lighthill-Whitham-Richards traffic model with a triangular fundamental diagram. The in-domain control action is realized through variable speed limits implemented as a source term in the governing hyperbolic partial differential equation. Unlike prior studies on infinite-horizon formulations, this article develops a finite-horizon LQR framework, deriving a space and time varying state feedback function for hyperbolic PDEs. The solution to the finite time optimal control problem relies on the solution of another PDE, called the Riccati PDE. The resulting nonlinear Riccati PDE is solved analytically via the parametric method of characteristics. The Riccati PDE solution is a function of both time and space, as well as the traffic regime. A sensitivity analysis demonstrates the effects of the LQR parameters for both the infinite and finite time horizon problem in different traffic situations, while siulations validate the finite-horizon LQR's ability to guarentee finite-time convergence. Comapred to the infinite-horizon LQR, the proposed approach achieves significantly improved control performance across various scenarios, making it particularly suitable for time-sensitive traffic management applications.
comment: 12 pages, 24 figures, submitted to IEEE Transactions on Control Systems Technology
Resilience metrics to guide back-up investments in the power system during extreme weather
Security of supply is a common and important concern when integrating renewables in net-zero power systems. Extreme weather affects both demand and supply leading to power system stress; in Europe this stress spreads continentally beyond the meteorological root cause. We use an approach based on shadow prices to identify periods of elevated stress called system-defining events and analyse their impact on the power system. By classifying different types of system-defining events, we identify challenges to power system operation and planning. Crucially, we find the need for sufficient resilience back-up (power) capacities whose financial viability is precarious due to weather variability and weather-induced risk. Furthermore, we disentangle short- and long-term resilience challenges (from multi-day to annual scale) with distinct metrics and stress tests to incorporate both into future energy modelling assessments. Our methodology and implementation in an open energy system model (PyPSA-Eur) can be re-applied to other systems and help researchers and policymakers in building more resilient and adequate energy systems.
Emergence-as-Code as a Foundation for Self-Governing Reliable Systems
Service-level objective (SLO)-as-code tools make per-service reliability declarative, but users experience journeys: end-to-end executions whose availability and tail latency emerge from topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. Journey objectives are therefore often maintained outside code and drift away from the effective runtime graph. We propose Emergence-as-Code (EmaC), a declarative contract that compiles journey-level SLI bounds and governance artifacts for declared SLOs from intent and evidence. An EmaC specification defines a typed journey expression, leaf bindings to atomic SLOs and telemetry, failure-domain assumptions, and guarded actions. Model Discovery proposes evidence-backed deltas for edges, branch probabilities, redundancy groups, and failure-domain hypotheses; each delta carries provenance and confidence. The compiler derives optimistic and pessimistic journey bounds and emits reviewable governance artifacts. An executable checkout replay shows that local SLOs can remain green while evidence-backed discovery changes the failure-domain model, collapses the pessimistic payment-race bound, and changes the rollout decision from pass to fail or review.
Toward Operationalizing Rasmussen: Drift Observability on the Simplex for Evolving Systems
Software operations increasingly rely on SLOs, traces, deployment specifications, and change events, yet dashboards and thresholding practices often expose share-like operational signals as separate scalar panels or baseline distances. This can create false alarms under benign redistribution and miss movement toward policy boundaries. Rasmussen's dynamic safety model motivates drift under competing pressures, but operationalizing it for software is difficult because relevant state variables (remaining margin, engineering effort, and risk/impact) are often compositional and their parts evolve. We formulate an automated, artifact-derived drift-monitor design that maps changing software artifacts into a stable compositional monitoring state: it extracts a current part inventory and policy constraints, maps telemetry to a positive composition, stabilizes splits, merges, and renames through lineage-aware canonical groups, and analyzes boundary-directed drift in log-ratio coordinates. The proposed monitor would report drift direction, step-to-boundary, balance-level attribution, and model-health indicators under architectural churn. We specify the approach, identify its zero/noise/lineage assumptions, and report a reproducible synthetic sanity check of boundary-aware drift and controlled part churn.
Deep reinforcement learning for process design: Review and perspective
The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate this transition. Specifically, deep reinforcement learning, a subclass of machine learning, has shown the potential to solve complex decision-making problems and aid sustainable process design. We survey state-of-the-art research in reinforcement learning for process design through three major elements: (i) information representation, (ii) agent architecture, and (iii) environment and reward. Moreover, we discuss perspectives on underlying challenges and promising future works to unfold the full potential of reinforcement learning for process design in chemical engineering.
Tunable Real-Time Safety Filters via Set-Based Control Barrier Functions
Safety filters for industrial constrained systems are required to combine certified constraint satisfaction, predictable online computation, and a transparent tuning interface. Existing set-based filters are based on a well-established control invariant set design that scales favorably with state and input constraints, but typically intervene only at the set boundary. Control barrier function (CBF)-based filters, by contrast, provide tunable intervention but require a scalar barrier construction. This paper proposes a set-based CBF safety filter that turns a convex control invariant set directly into a tunable barrier via its Minkowski functional. The resulting filter is formulated as a single-level quadratic program (QP) in which one class-$\mathcal{K}^e$ parameter sets the intervention aggressiveness. Explicit convex formulations are derived for polytopic, zonotopic, and MPC-based invariant sets. Under standard bounded-disturbance assumptions, the resulting safety filter guarantees constraint satisfaction and asymptotic recovery into the invariant set. For tight real-time budgets, a learning-based approximation enables online acceleration, while the formal safety guarantees remain tied to the exact formulation. The method is validated in numerical studies and on a permanent-magnet synchronous motor drive, where an explicit QP implementation evaluates within a 150 microseconds sampling window and has a worst-case execution time of 28.04 microseconds.
Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence
Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.
Online Learning for Supervisory Switching Control
We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy a suitable controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the matching controller in $O(N \log^2 N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.
Robotics
Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis
Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.
comment: 13 pages, 9 figures,
Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups
As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.
comment: 8 pages, 4 figures, submitted to RA-L
Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video
Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.
comment: Website: https://video2sim2real.github.io/
Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks
Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.
RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation
Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io
comment: 20 pages, 7 figures
Guided Discovery of New Behaviors using Diffusion Policies
Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.
comment: Preprint. Supplementary video: https://youtu.be/T7MUvMA67VM
Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments
Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.
Systems-Level Planning and Coordination of Truck-Drone Collaborative Delivery Networks
Urban last-mile parcel delivery increasingly relies on heterogeneous fleets whose performance depends on timely coordination, reliable communication, and scalable control. Truck-drone collaboration has emerged as a networked cyber-physical delivery paradigm that combines the payload capacity and range efficiency of trucks with the agility of drones in congested or access-limited urban environments. This paper proposes a layered planning and coordination framework that structures truck-drone collaborative delivery (TDCD) from a systems and control perspective. The framework consists of five interrelated layers: spatial-demand alignment, collaborative delivery configuration, resource and workflow orchestration, performance evaluation, and scalability analysis, providing a unified view of coordination, control, and system-level performance in networked delivery operations. The proposed framework is evaluated using a realistic urban last-mile delivery scenario derived from the 2021 Amazon Last Mile Routing Research Challenge dataset. The case study demonstrates how coordinated truck-drone operation, enabled by structured task orchestration and inter-agent synchronization, improves end-to-end system efficiency under operational constraints. Results show a 42.4% reduction in total delivery time and a 44.2% reduction in energy consumption compared to a conventional truck-only delivery model. The scalability analysis further highlights how coordination gains persist as system size increases, and shows the importance of efficient control and communication in heterogeneous delivery networks.
comment: 6 pages, 4 figures, Accepted to 2026 IEEE HPSR on Network Architectures and Intelligence for Smart Mobility and Autonomous Systems (TRAVERSAL)
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation
World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.
comment: 16 pages,13 figures
IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking
Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.
comment: 12 pages, 6 figures, project website: https://github.com/hanruihua/ir-sim
Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning ICRA2026
In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.
comment: 8 pages, 4 figures, accepted at ICRA2026
Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control
Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.
comment: Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)
PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback
Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.
Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras ICRA 2026
Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.
comment: 8 pages, 5 figures, accepted at ICRA 2026
Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language
Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).
comment: 18 pages, 7 figures, 3 tables
Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation
Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.
PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning
To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.
FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.
comment: Project page: https://fibertune.github.io/
HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning
Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.
Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation
Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.
FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation
Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.
OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation
Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.
comment: Project Page: https://oasis-humanoid.github.io/
When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA
Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.
comment: 16 pages, 4 figures, 4 tables
Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning
Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation
Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.
Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data
Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.
Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning
Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies
Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.
comment: 24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe
EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control
Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.
LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving
While large language models (LLMs) offer promising reasoning capabilities, their integration into safety-critical driving systems is hindered by limited reasoning diversity, high computational overhead, and static learning paradigms. To address these challenges, we propose LUNA-AD, a lightweight uncertainty-aware language model with lifelong learning for autonomous driving (AD). LUNA-AD features a tri-system architecture that reconciles complex multimodal behavioral reasoning, efficient deployment, and continual refinement. We design a multi-agent analytical system to generate uncertainty-aware decision-making demonstrations through diverse hypothesis exploration. A dual-head lightweight heuristic model is distilled to unify the inference of decision distributions and textual explanations while enabling efficient deployment. Furthermore, a reflection-driven lifelong learning mechanism operates on multimodal decision outputs and preserves strategic diversity, allowing for the refinement of candidate decisions and rationales via closed-loop feedback to enhance driving robustness. Extensive experiments on nuPlan benchmarks demonstrate that LUNA-AD achieves state-of-the-art success rates under both non-reactive and reactive modes, with drastically reduced inference latency compared to existing knowledge-driven AD frameworks.
comment: 16 pages,9 figures
Personalized and Robust Proactive Robot Assistance with Uncertainty-Guided LLM Reasoning
Proactive robot assistance in household environments requires accurate prediction of human activities and object usage under dynamic and noisy conditions. Existing approaches often rely on complex spatio-temporal models, which can be computationally expensive and sensitive to environmental variability. In this paper, we propose GLOBE, a lightweight framework that combines n-gram Markov models for capturing temporal behavioral patterns with uncertainty-guided large language model (LLM) reasoning. The framework performs sequential prediction efficiently while selectively invoking LLM reasoning only when the model confidence is low. To evaluate performance under realistic conditions, we introduce HOMER-Noise, a noisy extension of the HOMER+ dataset that simulates structured disturbances such as object movements caused by humans, pets, and toddlers. Experimental results show that GLOBE achieves competitive performance with state-of-the-art methods while improving robustness and computational efficiency across both clean and noisy settings. The framework is further validated through a proof-of-concept integration with a Stretch 3 mobile manipulator, demonstrating its potential application in real-world human-robot interaction scenarios.
comment: Accepted to the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN)
GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors
Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.
PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation
Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.
SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning
Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.
6G Empowering Future Robotics: A Vision for Next-Generation Autonomous Systems
The convergence of robotics and next-generation communication is a critical driver of technological advancement. As the world transitions from 5G to 6G, the foundational capabilities of wireless networks are evolving to support increasingly complex and autonomous systems. We examine the transformative impact of 6G on enhancing key robotics functionalities. It provides a systematic mapping of IMT-2030 key performance indicators to robotic functional blocks, including sensing, perception, cognition, actuation, and self-learning. Building upon this mapping, we propose a high-level architectural framework integrating robotic, intelligent, and network service planes, underscoring the need for a holistic approach. As an example, use case, we present a real-time, dynamic safety framework enabled by IMT-2030 capabilities for safe and efficient human-robot collaboration in shared spaces.
comment: IEEE Communication Magazine
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.
comment: 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/
Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning
Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.
Relational Epipolar Graphs for Robust Relative Camera Pose Estimation
A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.
comment: 21 pages, 11 figures, 11 Tables, Submitted to IJCV
Petri Net Modeling and Deadlock-Free Scheduling of Attachable Heterogeneous AGV Systems
The increasing demand for flexible automation has accelerated the adoption of heterogeneous automated guided vehicles (AGVs). This work investigates a new scheduling problem in a material transportation system consisting of attachable heterogeneous AGVs, including carriers and shuttles, that flexibly attach and detach for cooperative task execution. While such collaboration enhances operational efficiency, the attachment-induced synchronization renders the system highly coupled and susceptible to deadlocks. To address this, we propose a Petri net (PN)-based deadlock-free scheduling framework integrated into an adaptive large neighborhood search (ALNS) algorithm. The PN is introduced to map candidate solutions from static permutations into dynamic collaborative processes, enabling performance evaluation via state evolution and proactive deadlock prevention through structural analysis. Extensive experiments on real-world and synthetic instances demonstrate that the proposed framework significantly improves computational efficiency, with the developed ALNS outperforming the current on-site policy, exact solvers, and state-of-the-art metaheuristics. Finally, sensitivity analysis yields managerial insights for optimal fleet sizing.
comment: This work has been submitted to the IEEE for possible publication
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.
TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning
Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. Additionally, we propose a spectral normalization method with input-output rescaling to enhance the policy's temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its capability to achieve high-speed circular flights and continuous multi-flips.
comment: For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk
Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy
Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.
An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution
The general inverse kinematics (IK) problem of a manipulator, namely that of acquiring the self-motion manifold (SMM) of all admissible joint angles for a desired end-effector pose, plays a vital role in robotics modeling, planning and control. To efficiently solve the generalized IK, this paper proposes an interval branch-and-bound-based approach, which is augmented with a fast numerical IK-solver-enabled search heuristics. In comparison to independent solutions generated by sampling based methods, our approach generates patches of neighboring solutions to provide richer information of the inherent geometry of the SMM for optimal planning and other applications. It can also be utilized in an anytime fashion to obtain solutions with sub-optimal resolution for applications within a limited period. The performance of our approach is verified by numerical experiments on both non-redundant and redundant manipulators.
Multiagent Systems
PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting
Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.
RAILS: Verification-Native Clearing For Agentic Commerce
Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.
comment: 49 pages, 15 figures
Is Telehealth Better Used to Treat Patients or Help Other Physicians Treat Patients? An Agent-Based Modeling Study of Healthcare Provision
Telehealth, the delivery of medical care remotely, is hoped to increase access to specialty services or decrease health care utilization. Physicians can provide telehealth to each other or to patients. Specialists often treat complex patients who can be adequately cared for only in academic hospitals, suggesting that providing specialty services via telehealth will reallocate rather than reduce system utilization. Here I use agent-based modeling to investigate telehealth's effects on clinical outcomes and system utilization in medical toxicology. I found that physician-physician telehealth increased patient health but system utilization did not change. The effects were more pronounced as clinical complexity increased. Physician-patient telehealth increased cost and system utilization but not clinical outcomes. Within the limitations of our approach, these results suggest that telehealth is more cost-effective for improving generalist access to specialist knowledge than in providing care to the public.
comment: Presented at HICSS 2022
Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents
I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.
The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment
Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.
SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.
Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .
comment: Code, videos, and dataset available at https://co-glance.github.io/
GLP: A Grassroots, Multiagent, Concurrent, Logic Programming Language for AI (Full Version)
A grassroots platform is a multiagent distributed system in which multiple independent instances can form and operate independently of each other and of any global resource, yet may coalesce into ever larger instances, possibly resulting in a single global instance. Grassroots platforms aim to offer an egalitarian/democratic alternative to centralised/autocratic and decentralised/plutocratic global platforms. Here, we present Grassroots Logic Programs (GLP), a multiagent concurrent logic programming language designed for the implementation of grassroots platforms: we recall the standard operational semantics of logic programs; introduce the concurrent operational semantics of GLP as its restriction; recall multiagent atomic transactions; use them to introduce a multiagent operational semantics of GLP; and prove multiagent GLP to be grassroots. The grassroots social graph -- the foundational grassroots platform on which all others are based -- serves as a GLP programming example. These mathematical foundations are being used by AI to implement GLP as well as to program in GLP: a workstation-based implementation of concurrent GLP in Dart was derived from the concurrent operational semantics of GLP; a multiagent smartphone-based implementation of GLP in Dart/Flutter is being developed based on the multiagent operational semantics of GLP; a moded type system for GLP was designed (and implemented by AI in Dart) to facilitate collaborative human-AI development of GLP programs, where AI derives working GLP programs from human-approved type definitions and declarations; GLP implementations of grassroots platforms for the social graph, social networks, currencies and bonds, and more, have been derived by AI from mathematical specifications written as volitional multiagent atomic transactions.
Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)
Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. The language was introduced together with concurrent (cGLP) and multiagent (maGLP) operational semantics. Here, we derive from these (\ia)~dGLP, a deterministic counterpart of cGLP, and (\ib)~madGLP, a counterpart of maGLP in which deterministic agents communicate solely by asynchronous message passing, and prove them correct against their abstract counterparts. maGLP shared variable pairs spanning agents can be implemented as local variables paired by \emph{global links}, with correctness following from disjoint substitution commutativity (a consequence of GLP's single-occurrence invariant). We further prove that madGLP is grassroots. Both dGLP and madGLP serve as formal specifications for an AI-driven implementation discipline (math $\to$ informal spec $\to$ Dart) employed and described here: from dGLP, AI (Claude) developed a workstation-based GLP implementation in Dart, and from madGLP it is developing a smartphone-based multiagent one.
MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs
LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.
Moded Types for Grassroots Logic Programs, by AI, for AI (Full Version)
Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. ``Logic Programs as Types for Logic Programs'' (LICS'91) defined types as regular sets of paths over the Herbrand atom semantics of a logic program. Here, we develop a \emph{moded-atom semantics} that extends the standard Herbrand atom semantics in two ways: (\ia)~each atom subterm carries a \emph{mode}, recording whether it is consumed from or produced to the environment; and (\ib)~partial computations, including those that deadlock, fail, or never terminate, also contribute moded atoms to the semantics. We define types to be regular sets of \emph{moded paths} over this semantics, give a syntactic definition of GLP well-typing, and prove that a well-typed program is sound: every output path in its well-typed moded-atom semantics conforms to its declared output type. A type checker for GLP was implemented \emph{by} AI (Claude) in Dart, starting from the mathematical specification of Typed GLP (this paper), deriving from it an English+pseudocode spec (written by AI), and from the spec deriving Dart code (by AI). While GLP is naturally untyped, the motivation for typing it was \emph{for} AI: tasking AI to program complex communication modalities and hoping for the best turned out to be a tenuous strategy. The discipline we developed with Typed GLP is for the human designer and AI to jointly develop formal GLP type definitions and declarations, together with informal intent of the declared procedures, and only then let AI write the GLP code.
Unsupervised Partner Design Enables Robust Ad-hoc Teamwork
We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.
comment: 27 pages
SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching
Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.
Systems and Control (EESS)
Direct Data-driven Predictive Control: A Computationally Efficient Alternative to DeePC for Eco-driving in Mixed Traffic Flows
Improving energy efficiency in the transportation sector is critical for achieving sustainable mobility, with eco-driving emerging as a key strategy. However, implementing effective eco-driving for connected and automated vehicles (CAVs) in mixed traffic presents a significant control challenge due to the heterogeneous, uncertain behavior of human-driven vehicles (HDVs). Data-enabled Predictive Control (DeePC) offers a promising model-free approach but is often hindered by a high computational burden, limiting its real-time feasibility. This paper introduces a novel Direct Data-driven Predictive Control (D3PC) framework to address this limitation. By reformulating the data-driven prediction mechanism, the D3PC significantly reduces computational complexity, making its computation time nearly invariant to historical data size. This computational efficiency directly enables the formulation of a sophisticated eco-driving controller that can solve the complex energy optimization problem in real time, even within diverse and stochastic mixed-traffic environments. Comprehensive simulations demonstrate that the D3PC is orders of magnitude faster than existing DeePC-based methods while achieving superior energy efficiency. Specifically, it reduces total platoon energy consumption by up to 10.71% compared to rule-based cruise control baselines and 3.80% compared to the original DeePC, confirming its effectiveness for real-time, energy-efficient control.
comment: 6 pages
Optimal Control and Dissipativity of Linear Hermitian Matrix-Valued Dynamical Systems
We develop a unified framework for linear-cost optimal control, finite-time optimal steering, dissipativity analysis, and zero-sum differential games for linear impulsive systems whose state is a Hermitian matrix evolving in $\mathbb{H}^{n+m}_{\succeq0}$, a class that encompasses continuous- and discrete-time linear systems and switched systems as degenerate cases, and includes the second-order moment dynamics of linear (stochastic) hybrid systems. The entire theory rests on three tools: a single \emph{key identity} relating cost, trajectory, and a dual variable, an Extended Schur complement lemma, and a Schur inner-product decomposition, applied identically to the flow integral and to each jump. These yield structurally uniform sufficient and necessary conditions, dual linear matrix inequality (LMI) characterizations, and explicit optimal policies for every problem class, on both finite and infinite horizons under time-varying assumptions (without time invariance or periodicity), together with causal dwell-time policies for the problems that admit them.
comment: 81 pages
Adaptive Model Predictive Control of Nonlinear Generic Urban Air Mobility Using Linear Parameter-Varying Systems
This paper presents an adaptive model predictive control (MPC) framework for nonlinear urban air mobility (UAM) vehicles operating across the full flight envelope. The proposed approach leverages a linear parameter-varying (LPV) representation to update the predictive model online, enabling accurate capture of strongly nonlinear and time-varying dynamics associated with distributed electric propulsion (DEP) eVTOL aircraft. To systematically address the high-dimensional and coupled nature of MPC tuning, a multi-objective evolutionary optimization strategy based on NSGA-II is employed, incorporating proper normalization of states and control inputs to ensure balanced weighting and meaningful exploration of the design space. The resulting controller explicitly accounts for actuator constraints and enables reconfigurable control allocation for fault-tolerant operation. The framework is evaluated in nonlinear simulations using NASA's Generic Urban Air Mobility (GUAM) model and benchmarked against a robust servomechanism linear quadratic regulator (RSLQR). Results demonstrate that the proposed adaptive MPC achieves improved trajectory tracking and enhanced robustness under both nominal conditions and actuator degradation scenarios, including partial motor failure, while maintaining constraint satisfaction throughout all flight regimes.
comment: This paper was accepted for presentation at the Vertical Flight Society's 82nd Annual Forum & Technology Display, West Palm Beach, FL, USA, May 5-7, 2026
Energy Storage as a Multi-Use Asset: Applications Across the Power System
The energy transition in power systems requires flexible assets to offset renewable generation variability across multiple time scales, while supporting the integration of renewables and the electrification of demand without requiring costly grid reinforcement. Energy storage occupies a unique position among these assets: depending on the technology, it can provide short-duration grid services at high ramping rates, such as frequency regulation and voltage support, longer-duration functions such as intra-day peak shaving, or inter-seasonal energy buffering. This multi-service character, combined with the declining costs of energy storage technologies (most notably that of battery energy storage systems), is central to the economic viability of storage investments. The value of a given installation depends strongly on its grid connection point and intended use case: an asset-coupled battery serving a consumer or generation plant faces a different service landscape, and therefore a different business case, than a network-coupled system operating as an independent grid resource. This paper presents a structured taxonomy of grid-connected energy storage applications, discusses the principal application domains, and describes the key challenges that must be addressed to integrate storage effectively into power systems. Services are discussed with special emphasis on the Swiss regulatory context. Finally, the STORE flagship project supported by the Swiss Innovation Agency (Innosuisse), where some of the critical challenges of energy storage integration in power grids are addressed, is introduced.
comment: Manuscript submitted to the Grid Services & Markets Conference (GSM 2026)
Some Essential Constructive Foundations for Systems and Control
This work develops several constructive foundations for systems and control within Bishop-style constructive mathematics. For an engineer, the guiding principle is that an object claimed to exist, such as a trajectory, an optimal control law, a selector, or a viable solution, should come with finite data and an operation computing approximations to any prescribed precision. The style remains close to classical analysis, but existential statements are organized so that their computational content is visible. The paper begins with elementary geometric data in finite-dimensional Euclidean spaces: blocks, multiblocks, representable sets, regular functions, and certified integrals. This set-first integration route is meant to complement, rather than replace, abstract constructive integration theories such as Daniell-type or integration-space approaches. The developed apparatus is then applied to a constructive functional extremum-value theorem, selector extraction for multifunctions, Filippov-type and viable solutions of differential inclusions, regular probability densities, controlled Markov chains, and empirical density certificates. A short account of resolvent projectors and linear stability is included for completeness.
comment: 130 pages, 15 figures
Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning ICRA2026
In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.
comment: 8 pages, 4 figures, accepted at ICRA2026
Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control
Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.
comment: Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)
Nonlocal Teams and Information Structures
We look at Bell inequalities from the lens of information structures in stochastic teams. We consider the usual CHSH game and a dynamic variant of the same to study how various classes of strategies, classical, projective and quantum, behave under team theoretic solution concepts. We find that projective strategies (where each player performs projective measurements) enjoy important properties in the usual CHSH game, but they do not carry over to its dynamic version. These results shed light on the delicate interplay of information structure in quantum strategies and the fragility of some well known ideas under changes of information structure.
comment: 29 pages, 3 figures
Cooperative Guidance and Control for Active Asset Protection with Time-Varying Agent Speeds
Protecting an asset against threats is a challenging problem in an era of continuously evolving intelligent attacks. This requires cooperation between the asset and the defender to share information and jointly maneuver. To address this problem, this work proposes a cooperative guidance and control strategy for active asset protection against a maneuvering threat. This work develops a joint maneuver strategy where both the defender and the asset coordinate their time-varying speeds and courses to neutralize/capture the attacker. The control strategy is formulated around three coupled geometric and temporal objectives. The first objective is to set the line-of-sight rate between the asset and the attacker to zero, putting the attacker on a collision course and reducing their maneuvering. The second objective is to maintain the defender on the line-of-sight between the asset and the attacker. This ensures that the attacker faces the defender first before reaching the vicinity of the asset. Lastly, the defender is also guided to pursue the attacker based on the time-to-go estimates between the defender and the attacker. While keeping these objectives in mind, the control actions for the asset and the defender are jointly designed, fostering cooperation between the two. The stability of the proposed strategy is established using a Lyapunov-based approach. Numerical simulations performed show the effectiveness of the proposed cooperative strategy in ensuring the successful capture of a maneuvering threat.
Coil-Integrated Alignment Sensor for Real-Time Feedback of Coil-Scalp Contact Point and Angle During Transcranial Magnetic Stimulation (TMS)
Whereas coil positioning in transcranial magnetic stimulation (TMS) to reach a specific cortical target with modern focal stimulation coils has been intensively studied, the alignment and contact of a coil with the head is often ignored. Focal figure-of-eight coils have a point on the surface, where they generate the largest induced electric field. This point should touch the head first, and the coil should be approximately tangential to the head in this point. Previous research has demonstrated the large impact if the coil does not touch the head with the right point and that many operators struggle with establishing or maintaining the correct coil-scalp alignment. This paper presents a technological support technology that can monitor the exact position of the contact point and also pressure to provide feedback to users. As the system uses exclusively components from consumer electronics, the sensor is low-cost and affordable. Through proper design, we achieved sufficient robustness so that the sensor does neither reset during TMS pulses and also not show any detectable degradation.
comment: 7 pages, 1 figure
Bayesian Optimization of a Multi-Product Chemical Reactor Using Composite Models and Partial Physics Knowledge
We study data-driven real-time economic optimization of a multi-product chemical reactor when no reliable first-principles model is available beyond a steady-state energy balance. Instead of learning the economic objective directly as a black-box function, we use a composite formulation in which Gaussian process (GP) models predict physically meaningful outputs, including product concentrations and reactor temperature, while profit is computed analytically from these predictions together with raw-material, product, and utility prices. This preserves the structure of the economic objective, makes it parametric in changing prices without needing retraining, and allows candidate operating points to be checked against the available energy balance through a physics residual. The GPs also provide predictive uncertainty, which is exploited in a Bayesian optimization (BO) framework both for data-efficient exploration and for conservative enforcement of the reactor temperature constraint through an upper confidence bound. The acquisition function additionally penalizes large energy-balance mismatch obtained by substituting the GP-predicted outputs and candidate inputs into the available steady-state energy balance. The approach is demonstrated on a benchmark simulation of a non-isothermal multi-product reactor. Relative to a trust-region safe BO implementation, the proposed method achieves better simulated economic performance within the available iteration budget. Relative to a purely data-driven BO approach that does not use the available physics information, it avoids reactor temperature constraint violations.
comment: Accepted to IFAC 2026. 11 pages, 4 figures
Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning
Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.
RadioDiff-Inv2: Differentiable Diffusion Inversion under Location Drift from Sparse Noisy Measurements for Radio Map Estimation
Radio map (RM) estimation is a key enabler for environment-aware optimization in 6G wireless networks. In practice, RM construction increasingly relies on crowdsourced received signal strength (RSS) feedback that is inherently sparse and noisy. A further and often overlooked challenge is location drift, whereby privacy constraints and user mobility cause reported sampling coordinates to deviate from the true measurement locations. Unlike additive measurement noise, location drift perturbs the sensing operator itself, since each RSS sample effectively queries the underlying RM at an incorrect spatial coordinate. This operator uncertainty, compounded with sparse noisy sensing, renders the inverse problem severely ill-posed and limits conventional estimators that rely on analytically specified priors. This paper proposes RadioDiff-Inv2, a differentiable diffusion inversion framework that estimates RMs from sparse noisy measurements under location drift. A Gaussian resampling scheme is introduced to construct a differentiable, drift-aware measurement operator on grid-based maps, and the probability-flow ordinary differential equation (ODE) is exploited to cast the diffusion sampler as a deterministic, differentiable mapping from an initial noise code to the estimated RM. By optimizing the noise code via backpropagation against a drift-marginalized data-fidelity objective, RadioDiff-Inv2 produces reconstructions that are both prior-plausible and measurement-consistent without costly posterior sampling. Extensive experiments show that RadioDiff-Inv2 outperforms the best competing baseline by 4 to 14 dB in PSNR across varying sparsity and drift levels. The advantage is most pronounced in low-SNR regimes, where the learned diffusion prior maintains near-constant reconstruction fidelity while conventional methods degrade severely.
A Unified Framework for Contraction Stability Analysis of Heterogeneous Grid-Forming Inverters
The shift to renewable-dominated power systems has produced low-inertia grids, undermining system stability. In this context, grid-forming inverters (GFMs) have emerged as a promising solution. However, GFMs challenge conventional analysis techniques, especially those relying on small-signal or root-mean-square (RMS) models. Such models rely on linearization and sinusoidal steady-state assumptions, which fail in large-signal cases. Stability of GFM-based systems therefore becomes operating-point dependent, and a feasible operating point may not even exist. While large-signal analyses are available, decentralized certification of operating-point convergence with explicit transient guarantees, such as rate and overshoot, remains rare. This paper proposes an algebraic, decentralized contraction-based framework. The proposed contraction stability analysis certifies system stability and convergence to desired operating points. The method works in the time domain and captures nonlinear, large-signal behavior of synchronization and power-sharing mechanisms. Moreover, the contraction rate provides an explicit bound on transient time: trajectories converge exponentially to the new operating point at a controlled rate, yielding computable contraction regions that certify stability and large-signal convergence across operating-point changes. These regions directly guide parameter tuning for heterogeneous GFMs.
comment: 5 pages, 4 figures
Control-Theoretic View of Neural ODEs: Empirical Controllability and Observability
This paper studies neural ordinary differential equations (neural ODEs) from a control-theoretic perspective using controllability and observability concepts. The neural ODE is represented in a control-affine form to facilitate analysis using tools from nonlinear and linear time-varying (LTV) systems. Controllability is examined through trajectory linearization, where the LTV controllability Gramian provides a local, first-order measure of input influence along a nominal trajectory. Observability is analyzed through output linearization, where the LTV observability Gramian characterizes the local ability to reconstruct system states from output measurements. Koopman-based lifting is considered to extend the analysis to a higher-dimensional representation, and its limitations under multiple equilibria and basin-dependent behavior are discussed. The proposed framework is illustrated on a series RLC circuit. The learned neural ODE reproduces system trajectories and generalizes to unseen initial conditions. The computed Gramians are numerically full rank along the tested trajectories, indicating local controllability and observability of the linearized dynamics.
A Switching Beamformer for Highly Non-Stationary Environments
Adaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.
comment: 11 pages, 19 figures, under review
SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning
Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.
Optimal Kron-based Reduction of Networks (Opti-KRON) for Three-phase Distribution Feeders
This paper presents a novel structure-preserving, Kron-based reduction framework for unbalanced distribution feeders. The method aggregates electrically similar nodes within a mixed-integer optimization (MIP) problem to produce reduced networks that optimally reproduce the voltage profiles of the original full network. To overcome computational bottlenecks of MIP formulations, we propose an exhaustive-search formulation to identify optimal aggregation decisions while enforcing voltage margin limits. The proposed exhaustive network reduction algorithm is parallelizable on GPUs, which enables scalable network reduction. The resulting reduced networks approximate the full system's voltage profiles with low errors and are suitable for steady-state analysis and optimal power flow studies. The framework is validated on two real utility distribution feeders with 5,991 and 8,381 nodes. The reduced models achieve up to 90% and 80% network reduction, respectively, while the maximum voltage-magnitude error remains below 0.003 p.u. Furthermore, on a 1000-node version of the network, the GPU-accelerated reduction algorithm runs up to 15x faster than its CPU-based counterpart.
A Systematic Comparison and Evaluation of Building Ontologies for Deploying Data-Driven Analytics in Smart Buildings
Ontologies play a critical role in data exchange, information integration, and knowledge sharing across diverse smart building applications. Yet, semantic differences between the prevailing building ontologies hamper their purpose of bringing data interoperability and restrict the ability to reuse building ontologies in real-world applications. In this paper, we propose and adopt a framework to conduct a systematic comparison and evaluation of four popular building ontologies (Brick Schema, RealEstateCore, Project Haystack and Google's Digital Buildings) from both axiomatic design and assertions in a use case, namely the Terminological Box (TBox) evaluation and the Assertion Box (ABox) evaluation. In the TBox evaluation, we use the SQuaRE-based Ontology Quality Evaluation (OQuaRE) Framework and concede that Project Haystack and Brick Schema are more compact with respect to the ontology axiomatic design. In the ABox evaluation, we apply an empirical study with sample building data that suggests that Brick Schema and RealEstateCore have greater completeness and expressiveness in capturing the main concepts and relations within the building domain. The results implicitly indicate that there is no universal building ontology for integrating Linked Building Data (LBD). We discuss ontology compatibility and investigate building ontology design patterns (ODPs) to support ontology matching, alignment, and harmonisation.
comment: 32 pages
Petri Net Modeling and Deadlock-Free Scheduling of Attachable Heterogeneous AGV Systems
The increasing demand for flexible automation has accelerated the adoption of heterogeneous automated guided vehicles (AGVs). This work investigates a new scheduling problem in a material transportation system consisting of attachable heterogeneous AGVs, including carriers and shuttles, that flexibly attach and detach for cooperative task execution. While such collaboration enhances operational efficiency, the attachment-induced synchronization renders the system highly coupled and susceptible to deadlocks. To address this, we propose a Petri net (PN)-based deadlock-free scheduling framework integrated into an adaptive large neighborhood search (ALNS) algorithm. The PN is introduced to map candidate solutions from static permutations into dynamic collaborative processes, enabling performance evaluation via state evolution and proactive deadlock prevention through structural analysis. Extensive experiments on real-world and synthetic instances demonstrate that the proposed framework significantly improves computational efficiency, with the developed ALNS outperforming the current on-site policy, exact solvers, and state-of-the-art metaheuristics. Finally, sensitivity analysis yields managerial insights for optimal fleet sizing.
comment: This work has been submitted to the IEEE for possible publication
CaFTRA: Frequency-Domain Correlation-Aware Feedback-Free MIMO Transmission and Resource Allocation for 6G and Beyond
The fundamental designs of wireless systems toward AI-Native 6G and beyond are driven by the need for ever-increasing demand of mobile data traffic, extreme spectral efficiency, and adaptability across diverse service scenarios. To overcome the limitations posed by feedback-based multiple-input and multiple-output (MIMO) transmission, we propose a novel frequency-domain Correlation-aware Feedback-free MIMO Transmission and Resource Allocation (CaFTRA) framework tailored for fully-decoupled radio access networks (FD-RAN) to meet the emerging requirements of AI-Native 6G and beyond. By leveraging artificial intelligence (AI), CaFTRA effectively eliminates real-time uplink feedback by predicting channel state information (CSI) based solely on user geolocation. We introduce a Learnable Queries-driven Transformer Network for CSI mapping from user geolocation, which utilizes multi-head attention and learnable query embeddings to accurately capture frequency-domain correlations among resource blocks (RBs), thereby significantly improving the precision of CSI prediction. Once base stations (BSs) adopt feedback-free transmission, their downlink transmission coverage can be significantly expanded due to the elimination of frequent uplink feedback. To enable efficient resource scheduling under such extensive-coverage scenarios, we apply a low-complexity many-to-one matching theory-based algorithm for efficient multi-BS association and multi-RB resource allocation, which is proven to converge to a stable matching within limited iterations. Simulation results demonstrate that CaFTRA achieves stable matching convergence and significant gains in spectral efficiency and user fairness compared to 5G, underscoring its potential value for 6G standardization efforts.
comment: 17 pages, 19 figures. Accepted by IEEE Transactions on Mobile Computing
Solution Sets for Inverse Infinite-Horizon Linear-Quadratic Descriptor Differential Games
In this letter, we study a model-based inverse problem for infinite-horizon linear-quadratic differential games with descriptor dynamics. Given an observed feedback strategy profile, we seek to identify all cost functions that rationalize it as a feedback Nash equilibrium; this collection is referred to as the solution set. We characterize the solution set, show that it is rectangular and convex, and provide an algorithm for computing an admissible realization whenever it is nonempty. We also show that, compared with the corresponding inverse problem for standard state-space dynamics, descriptor dynamics modify the geometry of the solution set and may reduce identifiability. Finally, we illustrate the results with numerical examples.
Subspace Pruning via Principal Vectors for Accurate Koopman-Based Approximations
The accuracy of Koopman operator approximations over finite-dimensional spaces relies critically on their invariance properties. These can be rigorously quantified via the principal angles between a candidate subspace and its image under the Koopman operator. This paper proposes a unified algebraic framework for subspace pruning designed to systematically refine the invariance error. We establish the geometric equivalence between consistency-based methods and principal-vector pruning, and build on this insight to introduce a hybrid strategy that balances between multiple and single principal vector pruning for improved numerical stability and scalability. We derive error bounds for the retention of approximate and external eigenfunctions, demonstrating that the multi-vector approach mitigates the numerical drift inherent to sequential pruning. To ensure scalability, we develop an efficient numerical update scheme based on rank-one modifications that reduces the computational complexity of tracking principal angles by an order of magnitude. Finally, we exploit the subspace obtained from the pruning algorithms to build a lifted linear model for state prediction that accounts for the trade-offs between improving invariance and minimizing state reconstruction error. Simulations demonstrate the effectiveness of our approach.
comment: arXiv admin note: substantial text overlap with arXiv:2603.29001
Koopman Subspace Pruning in Reproducing Kernel Hilbert Spaces via Principal Vectors
Data-driven approximations of the infinite-dimensional Koopman operator rely on finite-dimensional projections, where the predictive accuracy of the resulting models hinges heavily on the invariance of the chosen subspace. Subspace pruning systematically discards geometrically misaligned directions to enhance this invariance proximity, which formally corresponds to the largest principal angle between the subspace and its image under the operator. Yet, existing techniques are largely restricted to Euclidean settings. To bridge this gap, this paper presents an approach for computing principal angles and vectors to enable Koopman subspace pruning within a Reproducing Kernel Hilbert Space (RKHS) geometry. We first outline an exact computational routine, which is subsequently scaled for large datasets using randomized Nystrom approximations. Based on these foundations, we introduce the Kernel-SPV and Approximate Kernel-SPV algorithms for targeted subspace refinement via principal vectors. Simulation results validate our approach.
A Unified Algebraic Framework for Subspace Pruning in Koopman Operator Approximation via Principal Vectors
Finite-dimensional approximations of the Koopman operator rely critically on identifying nearly invariant subspaces. This invariance proximity can be rigorously quantified via the principal angles between a candidate subspace and its image under the operator. To systematically minimize this error, we propose an algebraic framework for subspace pruning utilizing principal vectors. We establish the equivalence of this approach to existing consistency-based methods while providing a foundation for broader generalizations. To ensure scalability, we introduce an efficient numerical update scheme based on rank-one modifications, reducing the computational complexity of tracking principal angles by an order of magnitude. Finally, we demonstrate the effectiveness of our framework through numerical simulations.
Robotics
Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation
In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.
comment: 7 pages, 6 figures. Preprint version
MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model
Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.
comment: 17 pages, 8 figures
G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation
Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.
Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety
Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.
comment: 7 pages and 3 figures
SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation
Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking
Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.
comment: Accepted to RSS 2026
Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring
Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.
Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins
Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.
Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation
Monopedal hopping robots are conceptually simple but highly dynamic and inherently unstable. Achieving robust 3D hopping is still difficult because ground reaction forces are available only during the short stance phase, while the robot is underactuated in flight. A key unresolved issue is how to improve flight-phase control authority. Propeller assistance provides a promising solution, but it requires careful coordination of leg-generated contact forces and propeller thrusts across stance and flight. This paper presents Pro-OMEGA2, a propeller-assisted 3D monopedal hopping robot with an active 3-RSR parallel leg and a trunk-mounted tri-rotor for auxiliary attitude regulation. To address the force coordination challenge, we propose a Hierarchical Force Allocation (HFA) framework based on a single rigid body (SRB) model. The leg generates the main stance contact wrench, while the tri-rotor provides auxiliary attitude regulation, compensating the residual attitude moment in stance and maintaining attitude during flight. Real-robot experiments in indoor and outdoor scenarios demonstrate sustained 3D hopping, including terrain transitions and impulsive push recovery, validating robustness under unmodeled contact and external disturbances.
comment: 8 pages, 9 figures, 1 table. Accepted to the 2026 IEEE International Conference on Automation Science and Engineering (CASE)
Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving
With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.
CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning
Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.
comment: 23 pages, 11 figues, 4 tables, 1 listing
SynthICL: Scalable In-context Imitation Learning with Synthetic Data
In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io
Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs
The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.
Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning
Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.
Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data
Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/
Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations
Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.
comment: An updated version of this paper has been accepted by Nature Communications
Revisiting Articulated Parts Perception in Robot Manipulation CVPR2026
We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.
comment: CVPR2026
Continual Quadruped Robots Coordination via Semantic Skill Discovery
Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.
comment: 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/
Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation
Remote communication technologies have become widely used; however, supporting a sense of shared physical space and conveying rich non-verbal cues remain challenging in many social interaction scenarios. This study presents "Yui," a full-body cybernetic android avatar designed to integrate operator-side immersive teleoperation with interlocutor-side human-like social signaling. Yui combines a 55-degrees of freedom full-body mechanism with a previously developed android head, facial expression and gaze control, upper-body and arm motion, hand actuation, and a mobile platform. It can be operated through either the immersive mode using a head mounted display-based interface or desktop mode using a webcam-based interface. We evaluated the system through three real-world deployments: a long-term public exhibition at Expo 2025 in Osaka, Kansai, Japan; a remote educational exchange between elementary school students; and a public interaction study with general participants. During the Expo deployment, two units accumulated approximately 1131 h of operation, demonstrating both operational feasibility and maintenance challenges. In the public study, both operators and interlocutors reported positive impressions of co-presence and willingness to use the system. Interlocutors also rated the avatar positively in terms of human likeness and the transmission of emotions and intentions. The results indicate usability for general operators while suggesting room for improvement in precise controllability. These findings provide field-derived evidence and design implications for socially deployable full-body android avatars.
comment: 47 pages, 20 figures, 10 tables. Submitted to International Journal of Social Robotics
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.
comment: 17 pages, 3 figures, 12 tables
Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning
Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain
Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.
EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets
Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.
MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning
Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.
comment: 18 pages, 8 figures, 7 tables
IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations
Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.
comment: 26 pages, 9 figures
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies
We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.
comment: 13 pages, 3 figures, 4 tables
PRISM: PRior-guided Imagination Sampling in world Models
A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.
X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting
Whole-body teleoperation is essential for scalable robot data collection in loco-manipulation tasks, yet existing approaches relying on exoskeleton suits or multi-camera setups impose prohibitive cost, complexity, and environmental constraints. Recent methods using a single extended reality (XR) device with end-to-end reinforcement learning policies partially address these limitations but require robot-specific retraining, suffer from out-of-distribution failures, and rely on motion retargeting that neglects dynamic feasibility. We propose a hierarchical whole-body teleoperation framework driven by a single XR device that generalizes across diverse robot morphologies without retraining robot-specific policies. A Model Predictive Control (MPC)-based motion retargeter jointly optimizes alignment with the operator's intent and the robot's dynamic feasibility, generating optimal commands for existing low-level controllers. To ensure robust online execution, we introduce a state synchronization method that resets the simulator state at each MPC step to handle noisy real-world measurements and contact sensitivity, and integrate SLAM-based global pose feedback to mitigate long-term drift. Simulation results show higher success rates on whole-body control tasks for both a humanoid (over 30% lower completion time and 20% lower power consumption) and a mobile manipulator (zero collisions) compared to baselines. Real-world experiments further validate the effectiveness and flexibility of our method, demonstrating the successful deployment of the proposed retargeter on both platforms for whole-body control tasks and the ease of allowing users to adjust teleoperation behavior based on their preferences. This plug-and-play framework offers a scalable, morphology-agnostic solution for whole-body robot teleoperation, enabling real-time behavioral customization and broad applicability across platforms.
comment: 9 pages, 4 figures
Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.
Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io
comment: 24 pages, 11 figures
RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning
Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi
QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents
Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.
UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models
Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.
Worth Remembering: Surprise-Gated Robot Episodic Memory
Robots solving generalist tasks need to be able to ground instructions in their past experience, since humans may refer to notable past events when giving a task (e.g., ``Take me to where the chemical spill happened yesterday''). Since memory limits make storing all past events infeasible, long-term robot memory must be selective, ideally retaining only those episodes with high utility for future tasks. However, future tasks are not typically given a priori for generalist robots. To select generically useful memories, we propose Bayesian surprise as a gating mechanism for memory formation. We present an approach to compute surprise in a semantically rich deployment-agnostic latent space provided by V-JEPA-2. Using our gated episodic memory to augment 4D scene graph-based spatial memory, we show a consistent improvement over state-of-the-art benchmarks in robot question answering, outperforming prior robot memory methods by $\geq12\%$ for temporal, spatial, and binary questions, and surpassing the performance of supervised and non-causal methods with an unsupervised causal method in event segmentation tasks.
comment: 14 pages, 2 figures, 4 tables
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.
comment: 24 pages,11 figures
FingerEye: Learning Dexterous Manipulation with Continuous Vision-Tactile Sensing
Dexterous robotic manipulation requires perception that remains informative from pre-contact approach to contact initiation and post-contact control. We introduce FingerEye, a sensing and learning framework that strengthens robotic dexterity through continuous vision-tactile feedback throughout interaction. On the sensing side, FingerEye integrates binocular RGB cameras with a compliant contact interface to support perception both before and after contact. Before contact, the fingertip cameras provide close-range visual cues and implicit stereo for precise approach and object localization. After contact, marker-tracked deformation of the compliant ring provides a proxy for contact wrench sensing. On the learning side, we build real-and-sim infrastructure for data collection and evaluation, systematically study policy-interface designs for learning with multiple FingerEye sensors, and develop FingerEye Policy, which applies group-structured modality fusion to reduce modality shortcuts and better exploit distributed fingertip feedback. Across seven contact-sensitive task settings, FingerEye improves wrist-only policy by over 30 percentage points in mean success rate in both simulation and the real world.
Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control
This work presents a novel and efficient nonlinear programming framework that tightly integrates hierarchical decision-making with whole-body inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer nonlinear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less versatile $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying nonlinear problem formulations. In contrast, the proposed sparse hierarchical nonlinear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the $\ell_0$-norm which is rarely used in robotics. The solver efficiently tackles complex nonlinear hierarchical decision-making problems previously unaddressed in the literature, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.
comment: Accepted paper to "Robotics: Science and Systems" (2026)
Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning
End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.
comment: under review
On-the-fly hand-eye calibration for the da Vinci surgical robot
In Robot-Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable-driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand-eye transformation matrix on-the-fly. The framework consists of two interrelated algorithms: the feature association block and the hand-eye calibration block, which provide robust correspondences for key points detected on monocular images without pre-training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state-of-the-art methods while being more time-efficient.
comment: 18 pages, 17 figures, 5 tables
Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation
Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.
pacSTL: PAC-Bounded Signal Temporal Logic from Data-Driven Reachability Analysis
Signal Temporal Logic (STL) is an expressive language for specifying behaviors of dynamical systems from continuous signals. However, a limitation of standard STL is its inherently deterministic semantics, which prevents it from accommodating uncertainty. Existing approaches to overcome this limitation are computationally costly and limit real-time capability, requiring repeated trajectory sampling or the redesign of probability distributions over atomic propositions whenever the atomic propositions or specifications change. We introduce pacSTL, a framework that combines Probably Approximately Correct (PAC)-bounded reachable set predictions with an interval extension of STL. pacSTL computes lower and upper bounds on atomic robustness values by solving optimization problems over PAC-bounded reachable sets and propagates the bounds through the temporal logic operators. The resulting evaluation yields a PAC-bounded robustness interval at the specification level. We demonstrate the efficiency and relevance of pacSTL by verifying a quadrotor flight scenario and runtime monitoring a maritime navigation specification.
How Well Do Latent World Models Understand Partially Observable Safety Constraints?
Latent world models are a promising approach for learning state representations and dynamics directly from high-dimensional observations, enabling robot control in hard-to-model settings. However, control performance ultimately depends on the latent representation encoding the required information for the task. In this work, we study latent-space safe control problems and show how partial observability can induce control failures when safety-relevant information is not preserved in the latent state. Specifically, we identify two world model failure modes: estimation gaps, where current observations do not reveal safety-critical quantities (e.g., temperature in a cooking task), and prediction gaps, where failures are observable once they occur but cannot be reliably anticipated from available observations. We introduce two diagnostics for these gaps: a mutual-information-based measure of safety observability and a rollout-based measure of future safety predictability. Finally, we present mitigation strategies for each failure mode: privileged multimodal supervision for estimation gaps and conformal risk calibration for prediction gaps. Across two hardware case studies -- using unimodal RGB world models and multimodal RGB+Tactile and RGB+Thermal variants -- we show that these mitigation strategies improve the safety of a Franka Research 3 manipulator on challenging cooking tasks under partial observability, albeit with increased conservativeness. More broadly, our work raises the question of when world model state representations are sufficient for reliable robot control
comment: 10 tables 5 figures
Multiagent Systems
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.
Benchmarking Open-Ended Multi-Agent Coordination in Language Agents
As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.
comment: 42 pages, preprint
To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation
Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.
Toward Human-Centered Multi-Agent Systems: Integrating Cognition, Culture, Values, and Cooperation in AI Agents
The emergence of large language model (LLM)-based agents and multi-agent systems has enabled a shift from narrow task automation to more autonomous decision-making. Despite progress in language generation, planning, tool use, and coordination, most agents still treat intelligence as prediction, optimization, and task completion. Human environments are social and normative, where people reason under bounded rationality, communicate in culturally situated language, and make decisions guided by values, beliefs, trust, and social norms. This survey argues that future AI agents, especially those acting on behalf of humans, must move beyond task competence toward human-centered capabilities. We review research across six areas: (1) evolution of intelligent agents, (2) human cognition and decision-making, (3) language, culture, and social context, (4) human values and belief systems, (5) human-agent collaboration, and (6) multi-agent coordination and modeling of human characteristics. We synthesize work from cognitive science, sociolinguistics, computational social science, and AI alignment, along with recent advances in LLM agents, cultural alignment benchmarks, preference learning, explainability, and agent societies. We identify a key gap: existing systems do not provide a unified framework integrating cognition, culture, values, and social behavior into autonomous agents. We conclude with directions for building culturally aware, value-aligned, cognitively grounded, and cooperative multi-agent systems.
comment: 14 pages
Silent Failure in LLM Agent Systems: The Entropy Principle and the Inevitable Disorder of Autonomous Agents
Large Language Model (LLM) agent systems suffer from failures that occur without external triggers -- no injection, no adversarial input, no resource exhaustion. These silent failures -- unexpected deviations from intended behavior under normal conditions -- are routinely misattributed to bugs or configuration errors. Through systematic analysis of over 40,000 controlled trials and long-term production observations spanning 100,000+ agent interactions, we identify a common structural logic underlying these failures. Building on patterns observed in our experiments, we survey the global research literature on autonomous agent reliability and synthesize 22 intrinsic properties of LLM agent systems across six lifecycle layers: foundation semantics, inter-agent transmission, memory persistence, task execution, feedback correction, and systemic evolution. We demonstrate that whenever a sufficient subset of these properties co-exist, system entropy -- the measurable accumulation of disorder: loss of output consistency, task accuracy, and cross-session coherence -- increases monotonically with interaction rounds. We formalize this as the Entropy Principle: S(t) = S0 * e^(alpha * t), with alpha measured empirically across multiple architectures. We propose the PIG (Physical Integrity Gate) Engine with the ADE (Agent Delivery Engineering) protocol suite as an engineering countermeasure to entropy-driven disorder. Our findings establish silent failure not as a bug to be fixed but as a manifestation of Intelligence Entropy -- a physical constraint to be managed through deterministic governance. We argue that any engineering effort stabilizing the structure and order of agent systems participates in a unified mission: keeping intelligent systems reliable as they grow in scale and complexity.
comment: 10 pages, 7 figures
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.
Continual Quadruped Robots Coordination via Semantic Skill Discovery
Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.
comment: 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.
Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems ICML 2026
Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only one response can be delivered to the learner. In this paper, we study how voting protocols shape cooperation among four role-constrained pedagogical agents responsible for scaffolding, misconception, motivation, and metacognition. We compare four voting protocols -- simple, ranked, cumulative, and approval voting -- across two simulated tutoring environments on SciQ and HumanEval benchmarks. Rather than using voting as a simple aggregation step, we use it to analyze how collective decision rules shape coordination under partial pedagogical conflict. Across 1,200 simulated interactions, we find that agent deliberation and voting protocol type frequently change which response ultimately wins, showing that both meaningfully shape the collective decision. Different voting rules also produce distinct coordination behaviors, and even brief tutoring turns show measurable learning gains in simulated students. Overall, we show that protocol choice is associated with distinct coordination patterns among role-specialized pedagogical agents.
comment: Accepted to ICML 2026 Workshop on AI4Good
Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure
As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.
comment: 21 pages, 2 figures, 6 tables
EduMirror: Modeling Educational Social Dynamics with Value-driven Multi-agent Simulation ICML 2026
Understanding how educational social dynamics evolve is critical for informing effective educational policies and counterfactual interventions. However, traditional methods face a fundamental dilemma: observational studies often lack causal power, while controlled experiments are frequently constrained by ethical concerns. Although LLM-based multi-agent simulations offer a scalable in silico alternative, existing approaches remain limited by weak psychological grounding and insufficient measurement of latent psychological states. To address this, we introduce EduMirror, a multi-agent simulator for the scientific study of educational social dynamics. We provide configurable education-oriented agent forms, including value-driven agents grounded in psychological needs and social value orientation, together with a dual-track measurement protocol for quantifying observable behaviors and latent psychological states. We validate the realism and usability of EduMirror through case studies on school bullying and group cooperation, as well as broader evaluations across diverse educational scenarios. The results show that EduMirror generates educational social dynamics that are realistic, theory-consistent, and measurable by empirical criteria. These properties enable structured in silico educational research, providing a computational tool for hypothesis testing and counterfactual intervention analysis in educational science. Project page: https://edumirror.net.
comment: ICML 2026
Payoff scaling shapes cooperation in LLM agents across languages
Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.
comment: 44 pages, 17 figures, 4 tables
Decentralized Online Riemannian Optimization Beyond Hadamard Manifolds
We study decentralized online Riemannian optimization over manifolds with possibly positive curvature, going beyond the Hadamard manifold setting. Decentralized optimization techniques rely on a consensus step that is well understood in Euclidean spaces because of their linearity. However, in positively curved Riemannian spaces, a main technical challenge is that geodesic distances may not induce a globally convex structure. In this work, we first analyze a curvature-aware Riemannian consensus step that enables a linear convergence beyond Hadamard manifolds. Building on this step, we establish a $O(\sqrt{T})$ regret bound for the decentralized online Riemannian gradient descent algorithm. Then, we investigate the two-point bandit feedback setup, where we employ computationally efficient gradient estimators using smoothing techniques, and we demonstrate the same $O(\sqrt{T})$ regret bound through the subconvexity analysis of smoothed objectives.
RepoLaunch: Automating Build and Management of Code Repositories across Languages and Platforms
Language model (LM) agents have driven substantial progress in automated software engineering (SWE), yet building and testing software repositories at scale remains a largely manual and labor-intensive bottleneck. In this work, we introduce RepoLaunch, a novel agentic framework that automatically resolves dependencies, compiles source code, and extracts test results across diverse programming languages and operating systems. RepoLaunch achieves a 78% build success rate, outperforming the Python/Linux-only prior system by 18%. To demonstrate its application, we further present a fully automated pipeline for SWE dataset creation driven by RepoLaunch, which only requires human input at the task-design stage. RepoLaunch is open-sourced, and its automated task-generation pipeline has already been adopted by several recent works on agentic benchmarking and training.
comment: Under peer review. 22 pages, 5 figures, 9 tables
Channel Fracture: Three Instances of Cross-Boundary Silent Delivery Reliability Failures in Multi-Agent Systems
We report the discovery of channel fracture, a silent architectural failure in multi-agent systems where information routed across agent boundaries is silently blocked by invisible constraints. We present three instances in a production Hermes Agent deployment: (1) cron memory injection blocked by scheduler barriers; (2) cross-profile skill routing fractured by recursive directory traversal; (3) WebSocket delivery confirmation fallback fracture causing message duplication. We propose CADVP v1.1, a 13-dimension verification protocol with a veto-level confirmation check. Through 30,012 trials, zero failure rates under protocol versus 69 to 98 percent without. Real-world validation (10,008 trials) confirms quality elevation from 0.90 to 1.00. Three design principles: inverse verification, channel matching, and PIP protection.
comment: 39 pages, 6 figures, v3-pre expanded to three instances of cross-boundary channel fractures: cron memory injection (40,020 total trials), code-level rglob skill nesting, and WebSocket ACK delivery confirmation fallback fracture (DCFF). Includes PIP principal interest protection principles
Systems and Control (EESS)
Predictive Coding with Bayesian Priors via Proximal Gradients
We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level problem, we show that proximal gradient descent is precisely a leaky firing-rate network: the membrane leak, the effective recurrent matrix, the local synaptic drive, and the static nonlinearity all follow from one optimization principle, and the resulting circuit is the one proposed by Rao and Ballard. The prior selects the nonlinearity through its proximal operator, and the likelihood precision sets the gain on the observation. For the hierarchy, we show that a classical variable-splitting relaxation of the deep MAP problem yields hierarchical predictive coding as the interconnection of local and distributed solvers. In probabilistic modeling terms, this relaxation replaces the directed generative chain by an undirected Markov random field whose node potentials are the level-wise priors. Each level then applies its own activation function, namely the proximal operator of its prior.
comment: 13 pages, 2 figures, technical report
Benchmarking Sequential Feedback Optimization for Wind Farm Power Maximization
This paper benchmarks sequential feedback optimization (SFO) for wind farm power maximization using a medium-fidelity dynamic flow model. We compare SFO with two well-established approaches, adjoint-based economic model predictive control (AMPC) and extremum seeking control (ESC), under a common nine-turbine layout and identical operating constraints. The comparison focuses on steady-state power production and computational efficiency, both relevant for real-time implementation. The simulation results illustrate that SFO achieves higher steady-state power while preserving real-time feasibility, AMPC provides a better transient performance at a higher online computational cost and without guarantees of convergence to the steady-state optimum, and ESC offers a computationally inexpensive model-free baseline that may converge to locally optimal solutions. These results provide a practical reference for selecting wind farm control strategies and for designing scalable, real-time optimization methods.
comment: To appear in the Proceedings of the 23rd IFAC World Congress, 2026
Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety
Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.
comment: 7 pages and 3 figures
Exact Optimization-Free Safety Filters for Control Barrier Functions
For control-affine systems, standard and high-order control barrier function conditions are affine in the control input and are commonly enforced through quadratic-program-based safety filters. Although convex, these optimization problems may be undesirable in embedded, high-rate, or resource-limited implementations. This letter studies when the corresponding Euclidean projection can be computed exactly without solving a quadratic program. Given a nominal control input, we form the set of affine inequalities violated by that input and compute the minimum-norm correction that enforces those inequalities with equality. This correction need not equal the exact Euclidean projection onto the full feasible set. The main result gives structural conditions under which it coincides with the Euclidean projection onto the feasible set. These conditions are interpreted through interactions between affine-inequality normals and are expressed using a Gram matrix. Finally, an online certification procedure is given for determining whether the optimization-free update is exact.
comment: Submitted to LCSS
Risk-Aware Control of Systems with Quasi-Cone-Bounded Nonlinearities
We develop a tractable, rigorous approach to risk-aware control for a class of nonlinear systems. While many classical control methods reduce uncertainty to a simple average or a worst-case outcome, risk-aware control aims to equip systems with a refined awareness of uncertainty. Efficient methods for risk-aware control of linear systems are available, but there is a paucity of tools for tractable, risk-aware control of nonlinear systems. To bridge this gap, we develop an analytical, suboptimal controller with respect to a risk-aware performance criterion for systems with nonlinearities characterized by cone-like bounds. Numerical examples demonstrate benefits of the characterization of nonlinearities and risk that we consider.
comment: 11 pages, 3 figures, submitted to Automatica
A Barrier-Modulated Architecture for Safe Affine Formation Control in Second-Order Multi-Agent Systems
Affine formation control offers immense flexibility for coordinating multi-agent maneuvers, but guaranteeing the safety of agents under parametric uncertainties remains an open challenge. This paper proposes a novel safe affine formation control framework for second-order multi-agent systems by integrating Higher-Order Control Barrier Functions (HOCBFs) with Adaptive Dynamic Programming (ADP). We introduce a barrier-modulated control architecture that smoothly attenuates the nominal formation tracking objective when agents approach safety boundaries, preventing conflicting control inputs. Within this architecture, two distinct safety controllers are developed: (1) an analytical barrier-gradient repulsive controller that provides a computationally efficient, rigorous mathematical baseline, and (2) a data-driven optimal safety controller. The data-driven approach utilizes an actor-critic neural network to solve the Hamilton-Jacobi-Bellman (HJB) equation online, enabling optimal collision avoidance even in the presence of unknown system parameters. Using Nagumo's theorem and Lyapunov stability analysis, we formally prove that both controllers guarantee the forward invariance of the safe set ensuring absolute collision avoidance while maintaining Uniformly Ultimately Bounded (UUB) formation tracking errors. Finally, simulations validate the theoretical findings and demonstrate the robustness of the proposed controllers in dynamic obstacle avoidance scenarios.
A Global Convergence Analysis of Consensus ALADIN for Convex Optimization
Distributed optimization problems are pervasive in machine learning and optimal control. In this paper, we study smooth strongly convex distributed consensus optimization problems. We present a distributed optimization algorithm for consensus problems based on the Consensus Augmented Lagrangian Alternating Direction Inexact Newton (C-ALADIN) framework. Our algorithm uses an auxiliary variable to decide when to update second-order information, enabling curvature exploitation without sacrificing global convergence. This contrasts with existing C-ALADIN methods, which require constant Hessian approximations and thus lose numerical advantages. Under smooth strong convexity, the algorithm converges globally, and the auxiliary variable converges sublinearly. Numerical experiments on logistic regression show that our algorithm outperforms baseline methods that use either fixed or updated Hessian information.
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.
comment: 17 pages, 3 figures, 12 tables
When Can Phasor-Domain Device Models Be Trusted for Electromechanical Stability Analysis of Grid-Forming Converter-Dominated Microgrids?
Grid-forming (GFM) converter-dominated microgrids are often analyzed using reduced-order phasor-domain electromechanical GFM models, but the validity of these models is often taken for granted. Assuming ideal inner-loop tracking (IILT) of terminal-voltage references, these models neglect the inner-loop and filter dynamics at the electromagnetic-transient (EMT) timescale to simplify stability analysis. This paper argues that such neglected dynamics can destabilize the system, invalidating the stability conclusions drawn from the IILT model. To address this cross-timescale stability issue, we formulate the validity of the IILT stability conclusion as a robust-stability certification problem. The EMT-induced model mismatch between the reduced-order converter model and the actual converter model is represented as a structured uncertainty embedded around the IILT feedback loop. This yields a frequency-resolved interaction index and a structured singular-value sufficient certificate for determining when the stability conclusion of the IILT model can be certified with respect to a prescribed EMT uncertainty weight. The uncertainty weight can be obtained from detailed EMT models or terminal reference-response measurements. Case studies confirm that the proposed certificate correctly certifies model validity and identifies the loss of trustworthiness. We also demonstrate that the measurement-based uncertainty weights closely match the model-based ones, which enables deployment without accessing inner-loop models.
comment: 10 pages, 9 figures
Multidimensional Resilience for Electrical Power Systems: Systematic Review, Integrated Index, and Validation under Real-World Cyber-Physical Attack Scenarios
The accelerating decarbonization of energy systems has transformed electrical power systems into complex infrastructures exposed to threats whose interactions generate systemic vulnerabilities that conventional resilience approaches fail to capture. Although resilience assessment has expanded across multiple dimensions, existing studies largely examine them in isolation or adjacent pairs, leaving cross-dimensional couplings insufficiently explored. This study demonstrates i) that single-dimension assessments fail to capture the degradation produced by simultaneous cross-dimensional failures, ii) the nonlinear amplification emerging when physical, operational, and digital-cyber dimensions are jointly compromised, and iii) the intensification imposed by climatic and economic-regulatory stressors. To this end, we leverage a hybrid quantitative methodology. A PRISMA 2020 review with backward and forward snowballing identifies methodological gaps and unresolved dependencies across five resilience dimensions: physical, operational, digital-cyber, climatic-external, and economic-regulatory. Following this analysis, a Multidimensional Resilience Index (MDRI) is developed to capture endogenous couplings and exogenous amplification effects and is validated under escalating cyber-physical attack scenarios inspired by the December 2025 attack on Polish energy infrastructure. Results show that degradation under cascading and simultaneous failures is nearly eight times greater than under isolated stress, while exogenous conditions amplify degradation by an additional factor approaching six, with 72% of this amplification driven by exogenous stressors. Combined, these mechanisms produce a 46-fold increase in resilience loss compared to a single-vector reference.
comment: 35 pages, Elsevier Renewable and Sustainable Energy Reviews
Feedback Linearization and Control of a Grid-Forming Power Converter in an Islanded Microgrid
In an islanded setting, grid-forming inverters must regulate their terminal voltage without support from an external grid, even though the load current depends directly on that voltage. The usual approach is a cascaded proportional--integral (PI) controller, built on a fast inner current loop and a slower outer voltage loop, with feedforward terms used to compensate dq rotational coupling. However, this compensation is only exact at the operating point where the controller is tuned. This tutorial presents an alternative based on full-state feedback linearization. It is shown that the islanded inverter model has full relative degree, which allows exact state-space linearization with no internal or zero dynamics. A single feedback law cancels the main nonlinear effects; rotational coupling, resistive drops, and load conductance, so that the closed-loop system behaves like two independent double integrators. A standard pole-placement design is then used to shape the response. The controller is tested in MATLAB against a cascaded PI baseline under identical conditions at a 20 MW operating point, including reference tracking, load step disturbances, and parameter mismatch. The feedback-linearizing controller settles a reference step in 0.76 ms, while the PI controller does not reach the 2 % band within 50 ms. The cascaded PI controller shows better robustness to filter parameter mismatch due to its inner-loop integral action, which reduces steady-state errors under modeling uncertainty. Overall, the performance improvement and the robustness trade-off both come directly from the controller structures, rather than from tuning choices.
On Improved Statistical Accuracy of Low-Order Polynomial Chaos Approximations
Polynomial chaos expansions provide surrogate models for stochastic systems, with coefficients typically derived using Galerkin projection, stochastic collocation, or least squares approximation. These traditional approaches often fail to accurately capture statistical moments without resorting to high-order approximations. We propose a constrained optimization framework that modifies standard techniques to determine polynomial chaos coefficients that precisely recover the first two statistical moments. The effectiveness of our approach is demonstrated on several candidate algebraic functions of random variables, showing significant improvements in statistical accuracy even with low-order approximations.
Extremum Seeking Control Based Adaptive Compensation of Position Sensor Harmonics in PMSM Drives
Permanent Magnet Synchronous Machines (PMSMs) have become one of the preferred forms of electromechanical energy converters, attributing to their high efficiency, torque density, and other unique advantages. However, given the need for proper rotor position measurement for commutation and field orientation, accurate rotor position sensing is of paramount importance. In sensing motor rotor position with a sensor, harmonic errors that arise in the sensing subsystem lead to undesirable torque ripple. Thus, this paper presents an adaptive, extremum seeking control based approach capable of mitigating position signal harmonics in PMSMs. The proposed approach is experimentally validated under varying torque, speed, and harmonic conditions. Its harmonic compensation performance is comparatively evaluated against the look-up table based method. Furthermore, the accuracy of the proposed approach is analyzed, highlighting its effectiveness.
Qubit-Efficient Quantum Annealing for Stochastic Unit Commitment
Stochastic Unit Commitment (SUC) has been proposed to manage the uncertainties driven by renewable integration, but it leads to significant computational complexity. When accelerated by Benders Decomposition (BD), the master problem becomes binary integer programming, which is still NP-hard and computationally demanding for classical methods. Quantum Annealing (QA), known for efficiently solving Quadratic Unconstrained Binary Optimization (QUBO) problems, presents a potential solution. However, existing quantum algorithms rely on slack variables to handle linear binary inequality constraints, leading to increased qubit consumption and reduced computational efficiency. To solve the problem, this paper introduces the Powell-Hestenes-Rockafellar Augmented Lagrangian Multiplier (PHR-ALM) method to eliminate the need for slack variables, making qubit consumption independent of the increasing number of Benders cuts. To further reduce the qubit overhead, quantum ADMM is applied to break large-scale SUC into smaller blocks for sequential solutions, which does not scale with the number of generators. Finally, the simulation results on both 4-generator and the IEEE bus-118 systems demonstrate the feasibility and scalability of the proposed algorithm, indicating its superior qubit and runtime efficiency over classical and baseline quantum approaches on the D-Wave QPU platform.
TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.
comment: 18 pages, 8 figures
Heart Artifact Removal in Electrohysterography Measurements Using Algebraic Differentiators
Electrohysterography (EHG) enables non-invasive monitoring of uterine contractions but can be contaminated by electrocardiogram (ECG) artifacts. This work presents an ECG removal method using algebraic differentiators, a control-theoretic tool for model-free derivative estimation, that preserves signal shape outside the detected cardiac pulse locations. The differentiator parameters are designed to simultaneously suppress slow physiological artifacts and powerline interference. Cross-channel clustering distinguishes cardiac pulses from localized artifacts, enabling accurate pulse subtraction without auxiliary ECG references. Implemented as a causal FIR filter, the method is validated as a proof of concept on multichannel EHG recordings from one female and one male healthy volunteer and compared to the template subtraction method.
Decision-Focused Continual Learning for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks
Power-logistics scheduling in modern seaports typically follows a predict-then-optimize pipeline. To enhance the decision quality of predictions, decision-focused learning has been proposed, which aligns the training of forecasting models with downstream decision outcomes. However, this end-to-end design inherently restricts the value of forecasting models to a specific task structure and therefore generalizes poorly to evolving tasks induced by varying vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher-information-based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model across a varying task stream with sustainable long-term computational and memory requirements. Experiments calibrated to Jurong Port show improved decision performance and cross-task generalization over existing methods, together with reduced computational cost and a bounded memory footprint.
comment: Preprint to IEEE Transactions on Smart Grid
Revisiting Power System Stabilizers with Increased Inverter-Based Generation: A Case Study
As power systems evolve with increasing production from Inverter-Based Resources (IBRs), their underlying dynamics are undergoing significant changes that can jeopardize system operation, leading to poorly damped oscillations or small-signal rotor angle instability. In this work, we investigate whether Power System Stabilizer (PSS) setting adjustments can effectively restore system stability and provide adequate damping in systems with increased IBR penetration, using the benchmark Kundur Two-Area System as a case study. Specifically, we evaluate the model-based Residues and P-Vref PSS tuning methods to examine their effectiveness under evolving grid conditions. Our findings indicate that the effectiveness of these tuning methods is not guaranteed, particularly when coordination is limited. Consequently, our case study motivates local and adaptive online PSS tuning methods.
Grid-Forming Characterization in DC Microgrids
DC microgrids are converter-based electrical networks that are increasingly being used in various applications, including data centers and industrial distribution systems. A central challenge in their operation is maintaining the DC-bus voltage within predefined limits while ensuring overall system stability. Although a wide variety of converter control algorithms has been proposed to achieve these objectives, the literature lacks a clear and physically interpretable framework for evaluating their effectiveness and for classifying and comparing them. Moreover, the grid-forming versus grid-following distinction that exists in AC systems has largely been unexplored in DC microgrids. To address this gap, this paper introduces three novel impedance-based indices that can be used to quantify the voltage-forming and current-forming behavior of a converter. The indices also provide a basis for defining the desired converter behavior that yields superior DC-bus voltage regulation performance. Simulation results illustrate the application of the framework to several representative control strategies and highlight the strengths and limitations of these control algorithms.
Exploring Converter Control Duality in Microgrids: AC Grid-Forming vs DC Droop Control
Power electronic converters are fundamental building blocks of both AC and DC microgrids, enabling the integration of renewable energy sources, energy storage systems, electronic loads, and electric vehicles. In contrast, converter control in DC microgrids has developed along the path of droop control, which is widely adopted for decentralized DC-bus voltage regulation and power sharing. Although these control strategies share certain characteristics, their similarities remain largely unexplored due to the distinct physical domains in which they operate. To bridge this gap, we introduce a novel perspective based on the concept of duality to reveal the underlying isomorphism between the two control approaches. We show that AC grid-forming and DC I--V droop control are duals of each other in several aspects, including: (i) the small-signal model of the converter; (ii) the inner current control structure; (iii) power-sharing mechanisms based on the AC swing equation and DC capacitor power balance; and (iv) disturbance signals and dynamic response. Theoretical analysis, validated through simulations on simple converter setups, illustrates these dualities and provides new insights towards a unified control design.
Power System Robust State Estimation As a Layer: An Optimization-embedded End-to-end Learning Approach
Serving as an essential prerequisite for modern power system operation, robust state estimation (RSE) could effectively resist noises and outliers in measurements. The emerging neural network (NN) based end-to-end (E2E) learning framework enables real-time application of RSE but potentially yields solutions that are statistically accurate yet physically inconsistent. To bridge this gap, this work proposes a novel E2E learning based RSE framework, where the convex-relaxed RSE problem is innovatively constructed as an explicit differentiable layer into an NN as the first trial. This optimization-embedded layer (termed as `Opt-Layer` in our work) serves as a solver of the RSE problem. Then, the relaxed solutions are recovered through post-processing layers. Through seamlessly embedding the underlying KKT conditions into the gradients during backward propagation, the physical consistency in the estimated states could be significantly enhanced, realizing lower measurement residuals. Also, the measurement weights are treated as learnable parameters of NN to enhance estimation robustness, enabling the Opt-Layer to actively denoise. A hybrid loss function is formulated to pursue accurate and physically consistent solutions. Extensive simulations have been carried out to demonstrate that the proposed framework can significantly improve the SE performance especially in terms of physical consistency on eight test systems, in comparison to classical E2E learning models, physics-informed NN (PINN) models, graph-based learning models, and conventional optimization-based approaches. The estimation performances under partial observability, severe noise contamination are systematically evaluated. Computational complexity and runtime analysis are also comprehensively demonstrated.
comment: This work has been revised and resubmitted to the IEEE for possible publication
Robotics
Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation
The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre-designed trajectories. This study proposes a three-level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction-point affordances, enabling object-centric pose alignment of the quadruped robot and effective end-effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real-world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real-world setting forming an object-interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.
comment: This paper is submitted to Wiley Journal of Field Robotics
Physiologically Constrained Musculoskeletal Neural Network for Multi-DoF Joint Kinematics Estimation from Partially Observed sEMG
This paper investigates multi-degrees of freedom (DoF) joint kinematics estimation under partially observed surface electromyography (sEMG), where only a subset of task-relevant muscles can be measured due to anatomical inaccessibility or sensor constraints. A novel musculoskeletal neural network (MSK-NN) is proposed to estimate multi-DoF joint angles while simultaneously inferring activations for both measured and unmeasured muscles. MSK-NN consists of a CNN-based muscle activation estimator and an embedded MSK forward dynamics module, forming a fully differentiable architecture. Unlike existing hybrid neural frameworks that require additional biomechanical labels (e.g., muscle-tendon forces, joint torques), MSK-NN is trained without direct supervision of internal biomechanical variables. A composite physics-physiology loss is designed by incorporating a joint kinematics loss, a data-driven muscle synergy loss, and an anatomy-guided trend loss. The proposed method is evaluated on two-DoF wrist kinematics estimation across three rhythmic motions with unconstrained speed and amplitude, and one random motion. Compared with CNN, Bi-LSTM, CNN-LSTM, and PET baselines, MSK-NN achieves lower normalized root mean square error (NRMSE) and higher coefficient of determination (R2), especially for the random motion. More importantly, the optimized MSK parameters remain within physiological limits, and the estimated activation of an input-excluded muscle exhibits strong temporal agreement with its recorded sEMG envelope, demonstrating the capability of musculoskeletal (MSK)-NN to recover physiologically plausible activations.
Planning-aligned Token Compression for Long-Context Autonomous Driving
Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.
comment: 9 pages
On orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model
This paper addresses orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model within a transverse-linearization framework. We show that the corresponding transverse linearization is unstable and not stabilizable by linear state feedback. Therefore, the standard linearization-based approach to orbital stabilization cannot be applied directly. The main contribution is a set of explicit and verifiable conditions that characterize when a controller design based on transverse linearization remains applicable. These conditions rely on the specific structure of the dynamics in a neighborhood of the motion and on the use of non-standard transverse coordinates for controller design and analysis. Numerical simulations illustrate the proposed design procedure.
comment: 34 pages
Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability
The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.
Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists
Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.
comment: Submitted to RA-L
Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping
Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising results, but their scalability is limited by the cost and variability of collecting large amounts of real-world human demonstration data. In this work, we present a scalable simulation framework that automatically generates diverse reach-to-grasp demonstrations from a wrist-mounted virtual camera. The framework combines physically feasible grasp synthesis, natural reaching trajectories retargeting, and reach--grasp--lift execution in procedurally generated indoor environments. It records wrist-view observations, proprioception, and actions to build a large-scale demonstration dataset for imitation learning. Through extensive simulation benchmarks, we evaluate object and scene generalization and compare several representative state-of-the-art imitation learning methods. Results show that the simulated demonstrations are sufficiently rich and consistent for effective policy learning. In three realistic settings, the learned sim-to-real policy achieves over 90\% grasp success, surpasses baseline methods, and exhibits stronger generalization, highlighting the promise of simulation-driven training for biosignals-free shared-autonomy prosthetic grasping. The demonstrations are available at \href{https://sites.google.com/view/sim-prosthetic-grasp/home}{https://sites.google.com/view/sim-prosthetic-grasp/home}.
Spline Policy: A Structured Representation for Robot Policies
Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted spline can be decoded as a compact continuous trajectory, queried at different temporal resolutions, constrained or edited in parameter space, and passed to downstream controllers. For quadratic spline outputs, the same representation can also be converted into a state-dependent vector field through an analytical distance-field construction. Under the regularity and projection assumptions of this construction, the induced dynamics do not increase the distance to the generated spline, yielding a principled local corrective mechanism around the predicted motion. The spline output further supports uncertainty propagation from observations to spline parameters, trajectories, and flow fields, and can be combined with classical control mechanisms such as null-space collision avoidance without retraining the policy backbone. We instantiate SP with diffusion, flow-matching, transformer-based, and vision-language-action backbones. Experiments in low-dimensional motion learning, simulated manipulation under matched backbones, dexterous manipulation, and real-robot case studies show that SP remains compatible with modern policy learners while exposing useful motion-structure properties, including compact decoding, temporal resampling, local correction around predicted motions, uncertainty evaluation, and controller compatibility.
comment: This work has been submitted to the IEEE for possible publication
RhinoVLA Technical Report
Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.
Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos
Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.
CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning
Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representations, which spreads learning capacity across visually salient but planning-irrelevant content rather than the action-conditioned changes that drive manipulation outcomes. We propose CAPE, a Contrastive Action-conditioned Parallel Encoding framework that learns visual dynamics by distinguishing the future outcomes induced by different action sequences. Given an initial observation and a candidate action sequence, CAPE decodes the full future latent trajectory in a single forward pass and is trained with a Goal-Convergent Contrastive Objective that aligns predictions corresponding to the same future outcome while separating those corresponding to different outcomes. On real-world DROID and zero-shot transfer to RoboCasa, CAPE substantially outperforms prior baselines on future-state retrieval, offline action matching, and closed-loop planning, while notably reducing planning-time inference cost at long prediction horizons.
comment: 19 pages, 7 figures
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.
Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking
LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.
comment: Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
Robotic Policy Adaptation via Weight-Space Meta-Learning
Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.
An Abstract Architecture for Explainable Autonomy in Hazardous Environments
Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.
comment: Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022
Shield-Loco: Shielding Locomotion Policies with Predictive Safety Filtering
Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.
A Causal Probabilistic Framework for Perception-Informed Closed-Loop Simulation of Autonomous Driving
Software-in-the-loop (SIL) simulation is a cornerstone for the validation of modern automotive safety functions. However, many current frameworks utilize ideal sensing, which bypasses the functional insufficiencies of perception algorithms, leading to over-optimistic safety assessments. This paper proposes a perception-informed SIL testing methodology that bridges the gap between ground-truth simulation and real-world perception behavior. We present a framework for incorporating causal probabilistic models into standardized, scenario-based simulation toolchains, applicable to both Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (ADS). Our approach enables the systematic injection of realistic perception errors, such as loss of detection, sizing inaccuracies, and positioning offsets, derived from physical triggering conditions like fog, rain, and object-merging scenarios. By evaluating these ``faults'' within a standardized simulation environment, we demonstrate that perception-informed testing reveals latent operational risks that ideal SIL environments fail to capture, providing a scalable pathway for SOTIF (ISO 21448) validation.
Test-Time Trajectory Optimization for Autonomous Driving
End-to-end planners for autonomous driving typically generate a set of candidate trajectories, score each one, and return the highest-scoring candidate. However, the scorer is applied only after the proposals are generated and cannot influence the set of trajectories: a weak set of candidates limits planning performance regardless of the scorer's quality. We instead treat the scorer as a learned trajectory-level reward function and search for trajectories that maximize it. Our method, TOAD, runs the Cross-Entropy Method at test time, warm-started from the planner's proposals. It requires no retraining and is plug-and-play for existing planners. Across six base planners, TOAD improves results on NAVSIM-v1 (94.7 PDMS), NAVSIM-v2 (56.3 EPDMS), and the closed-loop HUGSIM benchmark. The code will be made publicly available via the project page: https://valeoai.github.io/TOAD/.
QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation
Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck.Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution.In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics.From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search.To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities.Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines.Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.
Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models
Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models
Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.
Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning
World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.
Predictive Style Matching: Natural and Robust Humanoid Locomotion
Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.
Extending Responsibility-Sensitive Safety for the Assessment of Offloaded Autonomous Driving Services SC
Safety is a fundamental requirement in the development of autonomous driving (AD) systems. While function offloading has demonstrated significant benefits in terms of computational efficiency and energy consumption, its application to safety-critical AD functionality introduces new challenges. In particular, offloaded service compositions incur increased and variable response times due to wireless vehicle-to-everything (V2X) communication, which directly affects the vehicle's reaction time and thus its safety guarantees. In this paper, we address this challenge by extending the definitions of Responsibility-Sensitive Safety (RSS) to explicitly account for different response times of local and offloaded AD service compositions. Based on this extension, we propose an integration into function offloading, using the RSS safety constraints for offloading decision-making and fallback mechanisms. Offloaded service compositions are only permitted if the current traffic situation remains safe under the corresponding end-to-end response time. If this condition is violated, the system performs a controlled fallback to local execution. Furthermore, we introduce an enhanced fallback strategy that includes a warm-standby phase for offloaded services, enabling faster and safer transitions from offloaded to local services. The proposed approach is integrated into our AD stack and evaluated in both simulation and the real world. Experimental results demonstrate that the proposed method improves safety compared to state-of-the-art function offloading and safety frameworks, while preserving the benefits of distributed computation when safety conditions allow.
comment: 8 pages; accepted for 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026 - DOI will be added after publication
A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration
Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.
comment: Submitted to RO-MAN 2026
Task Editing for Generalizable 3D Visuomotor Policy Learning
3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.
comment: 8 pages, 4 figures
Mission-Level Runtime Assurance Framework for Autonomous Driving
This paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions.
Compliance-Based Sensor Placement for Force Sensing on a Sensorized Prostate Phantom
This work presents a compliance-based sensor placement method for force sensing on a sensorized prostate phantom designed for Digital Rectal Examination training. The phantom combines three internal pneumatic chambers, used as intrinsic pressure sensors, with ten surface displacement markers. A finite-element simulation dataset is generated by applying external forces at sampled surface locations, from which a compliance matrix relating force inputs to pressure and displacement responses is constructed. Based on this matrix, we propose a weighted greedy selection strategy that maximizes local force reconstructability while prioritizing the clinically relevant posterior contact region and avoiding marker placement directly within the Region of Interest. Compared with a global QR-based placement strategy, the proposed method increases the mean reconstructability score in the target region by 22.5%. These results suggest that region-aware sparse sensor placement can improve force observability in soft robotic medical phantoms while maintaining a limited and practical sensing configuration.
LIMMT: Less is More for Motion Tracking ICML 2026
We argue that high-quality motion data can steer tracking policies toward better optimization trajectories early in training. In this work, we introduce LIMMT (Less Is More for Motion Tracking). To our knowledge, this is the first data-centric study for physics-based humanoid motion tracking. We go beyond simply removing low-quality and erroneous clips, but define motion data quality through three dimensions: physics feasibility, diversity, and complexity. We show that even training with under 3% of AMASS yields better tracking performance than training with the full dataset. We further conduct data cleaning on the estimated web-sourced mocap data. Extensive experiments and analyses validate the effectiveness of our framework.
comment: Accepted at ICML 2026
T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion
Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations using a Conditional Variational Autoencoder (CVAE). The learned priors enable smooth style transitions, facilitating a unified policy that adapts to terrain variations. We integrate T-GMP into an adversarial learning pipeline with our proposed Foothold Penalty, where a discriminator dynamically modulates naturalness constraints conditioned on local terrain features, guiding the generation of versatile and human-like motions. Experimental results demonstrate that our method outperforms existing baselines in traversal success rate and motion smoothness, while preserving biomimetically natural and physically coordinated motions.
ActionMap: Robot Policy Learning via Voxel Action Heatmap
Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://github.com/showlab/ActionMap.
A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.
comment: Corresponding author: Jin Xie
Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints
Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.
What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy IROS 2026
Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.
comment: 9 pages, 5 Figures, Code and videos are available at https://sites.google.com/view/design-t2-sa/home. Under review at IROS 2026
Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation
Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.
STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images
Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depend on action-relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task-planning setting in which a robot receives only image transitions: the current image, executed high-level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high-level actions that reaches the goal. To address this problem, we introduce STRIPS-WM, a framework for learning image-grounded STRIPS-style world models directly from visual transitions. STRIPS-WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS-WM improves image-to-plan success over the tested visual rollout, latent graph-search and latent-symbolic baselines.
Three-dimensional hydro-cluttered locomotion by an undulatory robot
Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obstacles that can disrupt open-water locomotion. In "hydro-cluttered" environments, water is interspersed with rigid and flexible clutter, making body-obstacle contact unavoidable. Operating in these spaces requires robots that can regulate and exploit contact, but this regime remains difficult to model or simulate. Building on recent advances in mechanical intelligence in terradynamically capable limbless robotics, we develop principles for 3D aquatic locomotion using AquaMILR, an elongate limbless robot that combines bilateral cable-driven actuation, programmable body compliance, distributed depth regulation, corrosion-resistant enclosures, and onboard power and electronics for untethered field operation. Systematic robophysical experiments reveal that programmable body compliance regulates body deformation and converts body-environment interactions into fast, robust, forward progression across increasing hydro-clutter constraint strength. Depth regulation provides three-dimensional access, allowing the robot to bypass clutter, recover from obstruction, and continue through otherwise inaccessible routes. In potential jamming scenarios, emergent inertia-induced rolling acts as a spontaneous recovery mechanism, freeing the robot from clutter that would otherwise lead to failure and allowing locomotion to continue without additional control. Tests of the robot in an aquatic mangrove field demonstrate that these principles transfer to practical operation, enabling navigation and onboard visual inspection of inaccessible root zones. These results establish principles for hydro-cluttered locomotion and a design paradigm in which aquatic robots exploit environmental complexity as a locomotor resource.
Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency
Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.
comment: Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension
This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.
comment: 21 pages, 26 figures
End-to-End Control of a Powered Knee-Ankle Prosthesis Towards Unified, Tuning-Free Assistance
Powered prostheses conventionally rely on impedance controllers that require extensive manual tuning and explicit mode classification. In this work, we present real-time deployment of an end-to-end prosthesis controller that estimates continuous actuator signals from onboard sensors, eliminating the need for intent classifiers and subject-specific tuning. Temporal Convolutional Networks were trained on a multi-terrain dataset from 18 individuals with transfemoral amputation and deployed in real time across five locomotion modes. Four participants (three able-bodied, one with transfemoral amputation) ambulated across level ground, ramp ascent and descent, and stair ascent and descent. During level walking, the deployed controller reproduced the training-data scaling of peak ankle torque with walking speed (deployed 0.85 Nm/kg per m/s, p = 0.001; training 0.96 Nm/kg per m/s, 95% CI [0.42, 1.50], p = 0.002), after excluding one outlier traced to atypical prosthesis loading. During ramp ascent, the controller scaled knee pre-flexion with grade (deployed 2.92 deg/deg, p = 0.027; training 3.30 deg/deg, 95% CI [1.83, 4.77], p < 0.001). During ramp descent, the controller increased resistive knee torque relative to level walking (deployed +0.16 Nm/kg, p < 0.001; training +0.16 Nm/kg, p = 0.008). Seamless stair transitions were generated for both intact- and prosthetic-side-leading sequences in ascent and descent, despite the training data containing only one limb-leading sequence. These results provide initial evidence towards end-to-end control that can provide unified, mode-adaptive prosthetic assistance without subject-specific tuning.
comment: 7 pages, 6 figures
TBD-VLA: Temporal Block Diffusion Vision Language Action Model
Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/
Predicting Dynamic Map States from Limited Field-of-View Sensor Data
When autonomous systems are deployed in real-world scenarios, sensors are often subject to limited field-of-view (FOV) constraints, either naturally through system design, or through unexpected occlusions or sensor failures. In conditions where a large FOV is unavailable, it is important to be able to infer information about the environment and predict the state of nearby surroundings based on available data to maintain safe and accurate operation. In this work, we explore the effectiveness of deep learning for dynamic map state prediction based on limited FOV time series data. We show that by representing dynamic sensor data in a simple single-image format that captures both spatial and temporal information, we can effectively use a wide variety of existing image-to-image learning models to predict map states with high accuracy in a diverse set of sensing scenarios.
comment: 6 pages, 4 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.
comment: Robotic World Model, Video Generative Model
SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.
comment: https://srl-ethz.github.io/SERNF/
A Human-Sensitive Controller: Adapting to Human Musculoskeletal Disorder-Related Constraints via Reinforcement Learning
Work-Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long-term disability. This study introduces a human-sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning (RL) to develop a human-aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q-Learning and Deep Q-Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation-to-real gap, a fine-tuning phase successfully adapted the policies to real-world conditions. DQN outperformed Q-Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels, achieving on average 38% shorter task completion times across all tested anthropometries. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL-driven cobots to enable safer, more inclusive workplaces.
CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks
Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics remains challenging due to their high-dimensional continuous joint action spaces, complex reward design, and non-stationarity from concurrently learning agents. On the other hand, humans often learn complex coordination with the help of coaches, who guide learning through carefully designed curricula and detailed feedback. Building on the reasoning capabilities of foundation models, we argue that these models can similarly coach robots to learn coordination. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for learning coordination Tasks, a framework that leverages foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). Then, CRAFT trains each subtask using LLM-generated reward functions, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, and demonstrate its capability to learn complex coordination behaviors. In addition, in a multi-quadruped navigation setting, we show that our learned policies transfer to the real world. Project website is https://iconlab.negarmehr.com/CRAFT/
Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic
We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.
From Pixels to Shelf: An Integrated Robotic System for Autonomous Supermarket Stocking with a Mobile Manipulator
Autonomous stocking in retail environments, particularly supermarkets, presents challenges due to dynamic human interactions, constrained spaces, and diverse product geometries. This paper introduces an efficient modular robotic system for autonomous shelf stocking, integrating commercially available hardware with a scalable algorithmic architecture. A major contribution of this work is the system integration of off-the-shelf hardware and ROS2-based perception, planning, and control into a single deployable platform for retail environments. Our solution leverages Behavior Trees (BTs) for task planning, fine-tuned vision models for object detection, and a two-step Model Predictive Control (MPC) framework for precise shelf navigation using ArUco markers. Laboratory experiments replicating realistic supermarket conditions demonstrate reliable performance, achieving over 98% success in pick-and-place operations across a total of more than 700 stocking events. However, our comparative benchmarks indicate that the performance and cost-effectiveness of current autonomous systems remain inferior to that of human workers, which we use to highlight key improvement areas and quantify the progress still required before widespread commercial deployment can realistically be achieved.
comment: Preprint for CASE 2026
Latent Geometry Beyond Search: Amortizing Planning in World Models
Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference. Our code is publicly available at https://github.com/hdnndh/Latent-Geometry-Beyond-Search-Amortizing-Planning-in-World-Models .
comment: 31 pages
MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models
Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.
comment: Under Review
Feasible Action Space Reduction for Quantifying Causal Responsibility in Continuous Spatial Interactions
Understanding the causal influence of one agent on another agent is crucial for safely deploying artificially intelligent systems such as automated vehicles and mobile robots into human-inhabited environments. Existing models of causal responsibility deal with simplified abstractions of scenarios with discrete actions, thus, limiting real-world use when understanding responsibility in spatial interactions. Based on the assumption that spatially interacting agents are embedded in a scene and must follow an action at each instant, Feasible Action-Space Reduction (FeAR) was proposed as a metric for causal responsibility in a grid-world setting with discrete actions.Since real-world interactions involve continuous action spaces, this paper proposes a formulation of the FeAR metric for measuring causal responsibility in space-continuous interactions. We illustrate the utility of the metric in prototypical space-sharing conflicts, and showcase its applications for analysing backward-looking responsibility and in estimating forward-looking responsibility to guide agent decision making. Our results highlight the potential of the FeAR metric for designing and engineering artificial agents, as well as for assessing the responsibility of agents around humans.
comment: In review
Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation
Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.
comment: Code is available at https://github.com/gxyes/MARS_Chameleon
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.
comment: Work in the progress. 16 pages, 13 figures
Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty
Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.
HORUS: A Mixed Reality Interface for Managing Teams of Mobile Robots
Mixed Reality (MR) interfaces have been extensively explored for controlling mobile robots, but there is limited research on their application to managing teams of robots. This paper presents HORUS: Holistic Operational Reality for Unified Systems, a Mixed Reality interface offering a comprehensive set of tools for managing multiple mobile robots simultaneously. HORUS enables operators to monitor individual robot statuses, visualize sensor data projected in real time, and assign tasks to single robots, subsets of the team, or the entire group, all from a Mini-Map (Ground Station). The interface also provides different teleoperation modes: a mini-map mode that allows teleoperation while observing the robot model and its transform on the mini-map, and a semi-immersive mode that offers a flat, screen-like view in either single or stereo view (3D). We conducted a user study in which participants used HORUS to manage a team of mobile robots tasked with finding clues in an environment, simulating search and rescue tasks. This study compared HORUS's full-team management capabilities with individual robot teleoperation. The experiments validated the versatility and effectiveness of HORUS in multi-robot coordination, demonstrating its potential to advance human-robot collaboration in dynamic, team-based environments.
comment: 7 pages, 7 figures, conference paper submitted to UR 2026
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.
Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation
A key challenge in contact-rich dexterous manipulation is the need to jointly reason over global geometry and nonsmooth contact dynamics. End-to-end policies bypass this complexity, but often require large amounts of data and transfer poorly from simulation to reality. We address the limitations with a simple insight: dexterous manipulation is inherently hierarchical--at a high level, a robot decides where to touch (geometry); at a low level it determines how to move the object through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object subgoal pose. Conditioned on the contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and real-time (re)plans through contact dynamics to generate robot actions that robustly move the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing across diverse object shapes, pivoting/flipping-based object reorientation, and environment-assisted object repositioning. It achieves high success rate with substantially reduced data (10 times less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.
Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking
This article presents an error-state Linear Quadratic Regulator (LQR) formulation for robust trajectory tracking in quadrotor Unmanned Aerial Vehicles (UAVs). The proposed approach leverages error-state dynamics and employs exponential coordinates to represent orientation errors, enabling a linearized system representation for real-time control. The control strategy integrates an LQR-based full-state feedback controller for trajectory tracking, combined with a cascaded bodyrate controller to handle actuator dynamics. Detailed derivations of the error-state dynamics, the linearization process, and the controller design are provided, highlighting the applicability of the method for precise and stable quadrotor control in dynamic environments.
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
comment: Fix minor metadata mistakes
Multiagent Systems
Modelling Opinion Dynamics at Scale with Deep MARL
Modelling opinion dynamics typically relies on hand-crafted local interaction rules to study emergent macroscopic phenomena such as consensus and polarisation. In contrast, multi-agent reinforcement learning (MARL) enables agents to learn such behaviours directly by optimising simple rewards. To explore the potential of MARL for opinion dynamics, we introduce a GPU-accelerated consensus and truth-finding game that scales to populations of up to 1000 agents, comparable to many real-world social sub-networks. To prevent unrealistic conventions, we extend other-play to general-sum social interactions. We next validate our model on a subset of the Bluesky network by recovering agent importance structures from graph topology alone via a learned attention layer, finding that highly conforming populations most closely match human data. In large social media networks such high levels of conformity significantly reduce collective accuracy and promote dishonest agents that lie to fit in. By contrast, small, dynamic hunter-gatherer networks are less affected; here, conformity can even improve collective agreement. This suggests a mismatch between evolved human conformity heuristics and modern social media environments as a potential contributor to misinformation.
comment: 35 pages, 28 figures, preprint
Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n<3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.
comment: 27 pages, 3 figures, 8 tables
Learning Multi-Agent Communication Protocol: Study on Information Entropy Efficiency in MARL
Multi-Agent Systems (MAS) have emerged as a fundamental paradigm for distributed problem-solving, where autonomous agents collaborate to achieve complex objectives. Within this framework, Multi-Agent Reinforcement Learning (MARL) with communication has demonstrated remarkable success in cooperative tasks. However, existing approaches predominantly pursue performance gains through increasingly complex architectures and expanding communication overhead, lacking principled metrics to evaluate the efficiency of information exchange. In this paper, we focus on enabling agents to learn efficient multi-agent communication protocols that balance performance and information compactness. We propose the Information Entropy Efficiency Index (IEI), a novel metric that quantifies the ratio between message entropy and task performance in learned communication protocols. A lower IEI indicates more compact and efficient message representations. By incorporating IEI into training loss functions, we encourage agents to develop communication protocols that achieve high performance with improved communication efficiency. Extensive experiments across diverse MARL algorithms demonstrate that our approach achieves equivalent or superior task performance compared to baseline methods while improving communication efficiency. These findings challenge the prevailing assumption that performance improvements require complex architectures or increased communication overhead and highlight the potential of improving both task success and communication efficiency to enable scalable MAS.
From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability
Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HTTP(S). Such transports protect message content, increasingly with end-to-end encryption. What they leave in the clear is the communication graph: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are often capability-labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships. It can infer the pending workflow, the task being assembled and the action likely to follow. At machine speed, it can act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone: predictive leverage over autonomous action. We give a threat model for the agent communication graph; identify what makes agent metadata distinctively revealing (semanticity, prospectivity, actuation); define transport- and bootstrap-layer privacy properties and weigh candidate transports (SimpleX/SMP, Tor, mixnets) against them; and present an A2A case study in which a metadata-protecting binding is expressible but surfaces the protocol's identity assumptions. We test these on a generative model anchored to a real A2A capture. From passive metadata alone, with no payloads, a classifier recovers a task's class well above chance, from only the workflow's opening; applied together, the properties drive that recovery sharply back toward chance. Beyond what an observer can recover, we measure the leverage of acting on the leak: from a workflow's opening and under a fixed budget, an adversary choosing which workflows to act on realizes in this model most of a clairvoyant attacker's advantage over a metadata-blind one, and the same properties suppress it.
comment: 12 pages, 6 figures
The Three-Ring Architecture: Governing Agents in the Era of On-Platform Organisations
The current phase of enterprise AI deployment faces a structural failure: organisations are acquiring agentic capability without the infrastructure to govern it. The result is expected to reproduce the error of the first wave of AI deployment: decentralised intelligence without a federation layer leading to a 95% project failure rate. This paper formalises the Three-Ring Architecture as the governing infrastructure of the on-platform organisation. Ring 1 is the existing production architecture; Ring 2 is the M2 federation layer built on strategies-based agentic AI; Ring 3 is the LLM-based frontier intelligence layer. Ring 2 constitutes, in the technically exact sense, the operating system of the agentic enterprise - performing at the organisational level what a computing OS performs at the device level: resource abstraction, process coordination, permission enforcement, and a stable platform for compounding intelligence. A central contribution is the formal distinction between Ring 2 and Ring 3 risk profiles. Strategies-based agents operate within a deterministic framework: their consequences are traceable, their permissions enforceable, their deviations recoverable. LLM-based agents introduce a categorically distinct risk: a non-deterministic actor whose deviations propagate through complex organisational systems without retrospective traceability. Ring 2 is not a useful addition - it is a necessary condition of control and compliance. A further consequence: every improvement in LLM capability is a structural tailwind for this architecture. More capable non-deterministic actors produce larger consequences when they deviate. The governance requirement scales with capability. The architecture has been validated across a decade of deployment in financial services, government, procurement, and compliance among other sectors.
comment: 28 pages
Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator
Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.
Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study
Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.
comment: 26 pages, 10 figures
Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method
LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.
GRPO Does Not Close the Multi-Agent Coordination Gap
We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher counts, four frontier closed-source systems reach mean reward 0.45 to 0.87 and Mistral-Small 24B reaches 0.83 to 0.99, while Qwen3-14B reaches 0.13 to 0.35. We then ask whether group relative policy optimization (GRPO) on rollouts from the task itself can close the gap and find that it cannot: a Welch's t-test on per-episode reward at five philosophers gives p = 0.66 and a Hedges' g of -0.11, with no statistically significant change at ten or fifteen philosophers either. Two further observations qualify the result. The training reward of both 8B and 14B runs peaked at step nine and then declined, so the default saved checkpoint at step 15 is strictly worse than several earlier ones. The four-term reward we use admits a degenerate maximum at zero actions, which DeepSeek-R1-Distill-Qwen-7B and Mistral-Small 24B at five philosophers both inhabit, with mean reward 1.0 and 0.83 respectively at zero meals. The bottleneck for an open-weight 14B model on multi-agent coordination is not training compute but training methodology: reward shaping that does not collapse to a no-action maximum, checkpoint discipline that does not depend on the final step, and curriculum across problem scales.
comment: 15 pages, 15 figures
Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence
LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.
comment: 12 pages, 1 figure
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.
TRAPS: Therapeutic Response Analysis via Pathway-informed Stratification
Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11\% positive prevalence.
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.
SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web
The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via coding agents, decompose the process into critical stages, and identify key technical hurdles. To systematically evaluate this capability, we propose SoftWare Agent generation for Agentic Web Bench (SW-$A^2$-Bench), the first benchmark designed for software agent generation. SW-$A^2$-Bench evaluates not only whether software agents can be generated, but also whether generated software agents are faithful to the source repositories and interoperable with other agents in multi-agent workflows. Our experiments demonstrate that our approach effectively activates the functional capabilities of code repositories and enables interoperable multi-agent collaboration in Agentic Web. We believe that this work will provide a standardized evaluation for software agent generation and will contribute to the future of scaling the capacity of Agentic Web.
Using Feasible Action-Space Reduction by Groups to fill Causal Responsibility Gaps in Spatial Interactions AAMAS 2026
Heralding the advent of autonomous vehicles and mobile robots that interact with humans, responsibility in spatial interaction is burgeoning as a research topic. Even though metrics of responsibility tailored to spatial interactions have been proposed, they are mostly focused on the responsibility of individual agents. Metrics of causal responsibility focusing on individuals fail in cases of causal overdeterminism - when many actors simultaneously cause an outcome. To fill the gaps in causal responsibility left by individual-focused metrics, we formulate a metric for the causal responsibility of groups. To identify assertive agents that are causally responsible for the trajectory of an affected agent, we further formalise the types of assertive influences and propose a tiering algorithm for systematically identifying assertive agents. Finally, we use scenario-based simulations to illustrate the benefits of considering groups and how the emergence of group effects vary with interaction dynamics and the proximity of agents.
comment: Presented at COINE workshop collocated with AAMAS 2026
Feasible Action Space Reduction for Quantifying Causal Responsibility in Continuous Spatial Interactions
Understanding the causal influence of one agent on another agent is crucial for safely deploying artificially intelligent systems such as automated vehicles and mobile robots into human-inhabited environments. Existing models of causal responsibility deal with simplified abstractions of scenarios with discrete actions, thus, limiting real-world use when understanding responsibility in spatial interactions. Based on the assumption that spatially interacting agents are embedded in a scene and must follow an action at each instant, Feasible Action-Space Reduction (FeAR) was proposed as a metric for causal responsibility in a grid-world setting with discrete actions.Since real-world interactions involve continuous action spaces, this paper proposes a formulation of the FeAR metric for measuring causal responsibility in space-continuous interactions. We illustrate the utility of the metric in prototypical space-sharing conflicts, and showcase its applications for analysing backward-looking responsibility and in estimating forward-looking responsibility to guide agent decision making. Our results highlight the potential of the FeAR metric for designing and engineering artificial agents, as well as for assessing the responsibility of agents around humans.
comment: In review
ReclAIm: A Multi-Agent Framework for Monitoring and Correcting Performance Decline in Medical Imaging AI
Purpose: To develop and evaluate a multi-agent framework (ReclAIm) for automated monitoring, detection, and correction of performance decline in medical image classification models. Materials and Methods: ReclAIm is a large language model-based multi-agent system that operates through natural language interaction. A master agent coordinating three task-specific agents performed performance evaluation and triggered fine-tuning when substantial performance declines were detected. The fine-tuning workflow incorporated data augmentation, class imbalance handling, and a parameter-anchoring regularization strategy to limit catastrophic forgetting. The system was benchmarked using multiple imaging datasets, including brain MRI, chest CT, and chest radiography, partitioned into model development, inference (performance monitoring), and fine-tuning subsets (60%:20%:20%). Results: ReclAIm successfully orchestrated training, evaluation, and performance monitoring across all datasets. Performance discrepancies between test and inference data were detected in 8 of 18 models, prompting fine-tuning workflows that reduced performance gaps. In cases with declines of up to 40.6% (cardiomegaly dataset, InceptionV3), fine-tuning restored performance metrics to within 2% of baseline values. Conclusion: ReclAIm provides a prototype framework for automated monitoring and targeted fine-tuning of medical image classification models, with a natural language interface designed to support accessibility in research and potential clinical applications.
comment: Published in Radiology: Artificial Intelligence (https://doi.org/10.1148/ryai.250923)
OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery
This yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection and discoverable AI resource products. It specifies the role architecture, \texttt{did:oan} identity objects, registration workflow, governance-backed Root lifecycle enforcement, Root-verified package model, authorization-aware Discovery, Root-issued infrastructure authorization VCs, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, domain-specific Agent protocols, Skills, MCP Servers, and Tool/API resources. OAN does not define the entire business conversation among Agents or the native protocol of every resource; it defines how resource identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.
OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection
OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides \texttt{did:oan}-based resource identity, governance-backed admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, Root-issued infrastructure authorization VCs, and signed trusted invocation. The architectural center of OAN is the combination of federated governance, resource identity, and trusted Discovery, rather than a single directory or naming service. This white paper explains the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, on-chain governance layer, prototype status, performance profile, and roadmap of OAN.
Noncooperative Coordination via a Trading-based Auction
Noncooperative multi-agent systems often face coordination challenges due to conflicting preferences among agents. In particular, when agents act in their own self-interest, they may prefer different choices among multiple feasible outcomes, leading to suboptimal outcomes or even safety concerns. We propose an algorithm named trading auction for consensus (TACo), a decentralized approach that enables noncooperative agents to reach consensus without communicating directly or disclosing private valuations. TACo facilitates coordination through a structured trading-based auction, where agents iteratively select choices of interest and provably reach an agreement within an a priori bounded number of steps. A series of numerical experiments validate that the termination guarantees of TACo hold in practice, and show that TACo achieves a median performance that minimizes the total cost across all agents, while allocating resources significantly more fairly than baseline approaches.
CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits ICLR 2026
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
comment: Accepted in ICLR 2026
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
comment: Fix minor metadata mistakes
A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic decision framework that resolves this trade-off by formulating method selection as a two-player zero-sum game between a Monitor (selecting computational methods and parameters) and Nature (selecting the unknown traffic scenario). We construct an end-to-end pipeline from trajectory surveillance data through eight candidate detection algorithms, a Monte Carlo sensitivity analysis characterizing their stochastic performance, and finally a multi-objective optimization layer that identifies Pareto-optimal method portfolios. The minimax solution provides a robust mixed strategy with a probability distribution over methods that guarantees worst-case performance regardless of scenario uncertainty. Experimental evaluation across 200 randomized configurations spanning 5--50 aircraft demonstrates that the framework recommends distinct method portfolios depending on operational priority: Koopman Phase dominates balanced (70.6%) and speed-priority (79.7%) profiles, while CRQA emerges as primary (47.4%) when route-lead identification is prioritized. The framework achieves a guaranteed game value of 0.29--0.53 (normalized utility) across all tested preference profiles, providing the first principled, scenario-adaptive methodology for computational method selection in UTM fleet monitoring operations.
Systems and Control (EESS)
OPENPATH: A Supervisor--Specialist Agent System for Personalized, Accessible, and Multi-stop Urban Trip Planning
Urban trip-planning systems are commonly optimized for travel time and cost, but they offer limited support for the heterogeneous needs that real travelers bring, such as personalized preferences, multi-stop itinerary construction, and end-to-end wheelchair accessibility. We present openpaths, a supervisor-specialist multi-agent system that handles all of these tasks within a single architecture. openpaths adopts a deliberate division of labor: LLM agents parse natural-language input, classify request intent, and orchestrate execution, while classical algorithms perform route optimization over curated mobility and accessibility data. This design ensures that the resulting trip honors heterogeneous user preferences and enforces strict accessibility requirements when requested. Beyond per-user planning, openpaths doubles as a measurement instrument for city-scale accessibility analysis: applied to NYC, the system reveals substantial ADA infrastructure gaps and quantifies their effect on job accessibility for wheelchair users. Overall, this study shows how a supervisor-specialist LLM agentic framework can support heterogeneous trip planning and transparent, equitable transportation analysis in real urban environments.
Physiologically Constrained Musculoskeletal Neural Network for Multi-DoF Joint Kinematics Estimation from Partially Observed sEMG
This paper investigates multi-degrees of freedom (DoF) joint kinematics estimation under partially observed surface electromyography (sEMG), where only a subset of task-relevant muscles can be measured due to anatomical inaccessibility or sensor constraints. A novel musculoskeletal neural network (MSK-NN) is proposed to estimate multi-DoF joint angles while simultaneously inferring activations for both measured and unmeasured muscles. MSK-NN consists of a CNN-based muscle activation estimator and an embedded MSK forward dynamics module, forming a fully differentiable architecture. Unlike existing hybrid neural frameworks that require additional biomechanical labels (e.g., muscle-tendon forces, joint torques), MSK-NN is trained without direct supervision of internal biomechanical variables. A composite physics-physiology loss is designed by incorporating a joint kinematics loss, a data-driven muscle synergy loss, and an anatomy-guided trend loss. The proposed method is evaluated on two-DoF wrist kinematics estimation across three rhythmic motions with unconstrained speed and amplitude, and one random motion. Compared with CNN, Bi-LSTM, CNN-LSTM, and PET baselines, MSK-NN achieves lower normalized root mean square error (NRMSE) and higher coefficient of determination (R2), especially for the random motion. More importantly, the optimized MSK parameters remain within physiological limits, and the estimated activation of an input-excluded muscle exhibits strong temporal agreement with its recorded sEMG envelope, demonstrating the capability of musculoskeletal (MSK)-NN to recover physiologically plausible activations.
On orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model
This paper addresses orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model within a transverse-linearization framework. We show that the corresponding transverse linearization is unstable and not stabilizable by linear state feedback. Therefore, the standard linearization-based approach to orbital stabilization cannot be applied directly. The main contribution is a set of explicit and verifiable conditions that characterize when a controller design based on transverse linearization remains applicable. These conditions rely on the specific structure of the dynamics in a neighborhood of the motion and on the use of non-standard transverse coordinates for controller design and analysis. Numerical simulations illustrate the proposed design procedure.
comment: 34 pages
Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability
The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.
An End-to-End Encrypted Control Pipeline for Multi-Agent Coordination via CKKS Homomorphic Encryption
Cloud-based coordination of multi-agent systems requires sharing state with a central server, creating a conflict between coordination and privacy. Fully homomorphic encryption (FHE) resolves this in principle, but its severe arithmetic constraints demand that every stage of the control loop be redesigned from first principles. We present an end-to-end encrypted control pipeline in which sensing, state estimation, state propagation, and consensus control all operate on CKKS-encrypted data using only addition, multiplication, and cyclic rotation. In order to overcome the computational challenges of FHE, we employ steady-state Kalman gains instead of solving for the matrices online and graph Laplacians are applied via the diagonal method at a cost proportional to the number of nonzero cyclic diagonals, accommodating ring, torus, and complete-graph topologies within a unified framework. To quantify the cumulative effect of encryption noise, we use the separation principle to decouple controller and observer error dynamics and derive a periodic bootstrapping bound in which CKKS bootstrapping acts as an impulsive disturbance; the resulting steady-state error ball depends on the bootstrapping precision and the closed-loop spectral radius, providing a direct design equation for the privacy-accuracy tradeoff. The pipeline is validated on a multi-agent formation control scenario, confirming stable closed-loop operation under encryption with bounded tracking error.
comment: 8 pages, 4 figures. This work has been submitted to the IEEE for possible publication
Unlocking feedforward capabilities in Model Predictive Control algorithms to deal with measurable disturbances
Disturbance rejection is a central objective in process control, particularly when measurable disturbances can be exploited through feedforward action. Although Model Predictive Control (MPC) naturally incorporates disturbance models and prediction capabilities, standard formulations cannot achieve complete disturbance rejection since the cost function penalises control effort. This limitation prevents MPC from reproducing the behaviour of classical feedforward compensators. This work proposes a novel framework to embed true feedforward capabilities within MPC without removing the control effort penalty. The approach introduces a dual-control structure in which two control actions are computed simultaneously: a tracking-oriented action addressing set-point tracking and robustness, and a feedforward-oriented action dedicated to disturbance rejection. Both contributions are combined into a single control signal on which the process constraints are explicitly enforced. The feedforward-oriented action is formulated without penalising control effort, enabling full compensation of measurable disturbances. The methodology is developed for Dynamic Matrix Control (DMC), Generalised Predictive Control (GPC), and state-space MPC. Its effectiveness is demonstrated through simulation studies, including comparisons with standard MPC and classical feedforward schemes. A case study based on a reverse osmosis process shows that the proposed approach improves disturbance rejection while preserving constraint handling and overall control performance.
SABLE: GPU-Based Power Flow Accelerator for Sparsity-Aware Batched Learning
Recent studies have developed GPU-based approaches for solving AC power flow and successfully applied them to standalone power flow problems. However, integrating these approaches into modern differentiable learning frameworks while preserving sparsity remains challenging. To this end, we present SABLE, a GPU-based sparse batched power flow accelerator for differentiable learning via an implicit power flow layer. SABLE leverages a block-diagonal embedding that reformulates batched three-dimensional Jacobians as a fixed-pattern two-dimensional sparse template that is shared across PyTorch, CuPy, and cuDSS. This common template enables zero-copy interoperability and memory-efficient sparse reuse across the software stack. On top of this representation, SABLE accelerates repeated power flow computations through reusable sparse templates, custom GPU kernels, a cuDSS-based sparse-direct LU solver, and mixed-precision techniques. Extensive experiments show that SABLE improves standalone power flow solving throughput by up to 253.4$\times$ over pandapower and 5.7$\times$ over ExaPF. In end-to-end training, evaluated on AC optimal power flow learning models based on DC3 and DeepLDE, SABLE expands the feasible training batch range by up to 64$\times$ and improves training throughput by up to 206.7$\times$ over the corresponding baseline.
comment: 10 pages
Branch-Level Energy Localization in Three-Phase Loads: Resolving Indeterminacy in Time-Domain
This paper develops a branch-level energy-localization framework for three-phase loads. The instantaneous terminal power of an admissible lumped equivalent is decomposed uniquely as Joule dissipation plus magnetic and electric stored-energy rates, branch by branch. Three formal results are established: a Branch-Level Localization Theorem (uniqueness given an admissible topology); a Topology-Indeterminacy Theorem (multiple admissible topologies reproduce identical terminal data with distinct localizations); and a Generalized Energetic Duality Theorem that organizes classical electrical dualities (Norton-Thevenin, series--parallel, L vs C, R vs G) as restrictions to Linear Time Invariant (LTI) sinusoidal regimes of a single time-domain principle in which constant-parameter equivalence is replaced by time-varying parameters. The framework is exercised on six test cases including the de Leon--Cohen open-phase paradox, switched-resistive loads, three-wire delta-versus-wye-virtual indeterminacy, fluctuating-phase loads, and a four-wire nonlinear load with hysteretic, linear, and switched branches. The framework is positioned as complementary to IEEE Std. 1459, CPC, instantaneous p-q, and Fryze-Buchholz-Depenbrock: each answers a different question, and the apparent paradoxes vanish once the question is posed precisely.
Geometric Time-Domain Identification of Three-Phase Load Equivalents from Terminal Measurements
This paper presents a geometric time-domain method for identifying three-phase load equivalents from instantaneous voltage and current measurements at the point of common coupling. Measured waveforms are interpreted as trajectories in Euclidean signal spaces, and load-equivalent parameters are recovered from the geometry of those trajectories. The method extends a previously published single-phase geometric identification formulation to three- and four-wire systems and places special emphasis on the three-wire case, where no neutral voltage is measured and the terminal data must satisfy coupled Kirchhoff constraints. The main advance over the earlier analytical formulation is a sampled-data implementation based on local time windows, normalized matrix equations, harmonic-projection derivative and primitive coordinates, explicit geometric identifiability tests, passivity constraints, and energy/Kirchhoff residuals. The method does not force a model when the measured trajectory lacks enough information; instead, it reports low-rank or ill-conditioned windows as low-confidence evidence. Numerical simulations with clean data, measurement noise, window-length sweeps, and sensor delay show that the method accurately identifies informative three-phase trajectories and exposes structurally degenerate cases such as pure single-frequency excitation for higher-order three-wire models. For a given admissible topology the identified circuit closes the instantaneous terminal energy balance of the measured load over the analysis window.
Power Grid Topology Control
Power grids are facing major challenges from growing renewable integration and worsening climate impacts. While flexibility on both the demand and generation sides has been widely explored to address these challenges, network-side flexibility, especially in network topology, remains highly underutilized. Advances in communication, power electronics, and circuit breakers have made network topology increasingly controllable. However, leveraging this topological flexibility poses substantial challenges, primarily due to the inherent non-convexity and hybrid dynamics in associated optimization and control problems. This monograph surveys the development of power grid topology control in both early and recent years. It begins by discussing the fundamental topological constraints involved in topology control problems. Subsequently, it introduces steady-state topology control for transmission and distribution networks separately, covering fundamentals, a state-of-the-art review, and representative recent advances. Additionally, the network topology transition problem, which addresses the implementation of optimal topology solutions and has garnered increasing attention in recent years, is further modeled and analyzed. Beyond utilizing the flexibility of steady-state network topology, controlling network topology during transients can also contribute to system stabilization. Traditional approaches, such as intentional controlled islanding for transmission networks, as well as recently developed topology control methods for microgrid stabilization, exemplify this concept. Finally, a summary of this monograph is provided.
Forecast and Model Predictive Control of Distributed Energy Resource Aggregators for Net-Demand Balancing
With the rapid demand for energy, even the incorporation of bulk renewable energy sources is not entirely sufficient to meet demand besides adding supply uncertainty. Distributed Energy Resource Aggregators (DERAs) have the potential to address this uncertainty via aggregation and control of decentralized distributed energy sources, thereby acting like virtual power plants. We present a new approach that combines forecasting and model-predictive control to assign DERAs to follow net-demand patterns, while accounting for the dynamics of the aggregate energy sources and their capacity limits. Each DERA is represented as a flexible ``virtual battery" with constraints on state-of-charge and power limits. The dispatch problem is set up as a long-term model predictive control task that aims to minimize differences from desired charge levels, output ramping, and net-load tracking errors. To keep operations efficient in real time, we implement a rolling-horizon MPC, which updates decisions regularly using the latest marginal-demand forecasts. For forecasting, we present two models: linear regression and long-short term memory (LSTM) neural network. Using high-resolution CAISO net-demand data and five typical DERA types, our simulations demonstrate how well our approach tracks marginal-demand; in particular, we highlight the tradeoffs between forecasting horizon times and MPC update rate as well as the dependence on the choice of the load forecasting model. Our results also indicate a slight edge for LSTM models over linear regression for desired time shifts and horizon choices.
Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency
Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.
comment: Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension
This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.
comment: 21 pages, 26 figures
CellSense: A Sub-6 GHz Cellular ISAC System for Clutter-Robust Passive Sensing
Future wireless networks demand capabilities beyond traditional communication, driving the development of Integrated Sensing and Communication (ISAC) for environmental awareness, localization, and tracking. Ubiquitous cellular deployment allows ISAC to maximize spectral efficiency, lower costs, and expand sensing coverage. However, sub-6 GHz research has heavily favored communication, leaving sensing capabilities largely underexplored. To bridge this gap, we introduce CellSense, a novel sub-6 GHz ISAC architecture natively integrated into the 5G cellular protocol stack for real-world target tracking. We validate the system via Sionna-based orthogonal frequency-division multiplexing (OFDM) link-level simulations and an experimental USRP hardware prototype using the OpenAirInterface (OAI) stack. Furthermore, we analyze the communication-sensing tradeoff by quantifying how pilot symbol density impacts throughput versus sensing accuracy. Simulations show that CellSense achieves a 74 percent detection probability with a 1.43 m localization error in indoor warehouse environment, which improves to 94 percent detection and a sub-meter error of 0.33 m in the outdoor environment of Oval area at the NCSU Centennial campus. Hardware experiments in a highly cluttered indoor laboratory confirm a 1.28 m localization accuracy and 76 percent detection probability, proving its efficacy for practical ISAC deployments.
comment: 6 pages, 13 images and submitted to MILCOM 2026
Adverse Effects of V2V Adoption on Road Safety
Vehicle-to-vehicle (V2V) communication is expected to improve road safety and reduce congestion. However, prior work shows that V2V information sharing under partial adoption may increase congestion and decrease safety. We study whether increasing V2V adoption itself affects road safety. We propose a corrected version of an existing model and analyze its behavior under varying adoption levels. We show that, in some cases, increased V2V adoption can increase accident probability. Moreover, under an optimal signaling policy, the system can ensure that accident probability is non-increasing in the adoption level.
Stability Without Safety: Gain Manipulation Attacks on Agentic Cyber-Physical Systems
Agentic cyber-physical systems (CPS), where autonomous AI agents participate in runtime control decision-making, introduce agent-driven parameter-update pathways absent from conventional feedback architectures. These pathways form a parameter channel structurally distinct from classical sensor and actuator channels. Among these parameters, feedback gains are the highest-leverage target: a single gain matrix determines closed-loop eigenvalue placement for the entire system, and malicious updates can directly alter closed-loop dynamics while evading residual-based monitors. We formalize this attack surface through a three-axis attacker model and a taxonomy of Gain Manipulation Attacks (GMA). Two impact classes are identified: stability-margin erosion under sustained gain drift, and transient amplification under one-shot gain replacement. A stability-preserving gain replacement can still produce transient amplification far exceeding safe operating limits, and stability verification alone is insufficient to bound the physical impact of such attacks. Stealthiness conditions and worst-case impact certificates are derived for each class via Bauer--Fike eigenvalue bounds and the Kreiss matrix theorem, with preliminary detection directions and a vehicle lateral dynamics example provided.
comment: 6 pages, 2 figures, 2 tables
Koopman meets input-output data: Data-driven output-feedback control of nonlinear systems with closed-loop guarantees
Data-driven control of nonlinear systems from input-output measurements remains a fundamental challenge, as existing approaches with rigorous closed-loop guarantees predominantly require access to full state measurements. In this paper, we address this gap by proposing a data-driven output-feedback controller design method for nonlinear systems that provides provable closed-loop guarantees while operating solely on measured input-output data. Our approach combines Koopman operator theory with an extended state representation of the nonlinear system constructed from input-output trajectories. This allows us to obtain a bilinear surrogate model directly from data, on which robust state-feedback design methods can be applied. By exploiting the observability of the underlying nonlinear system, we establish exponential stability of the extended state, which in turn implies exponential convergence of the original system state to the origin. Finally, we validate our theoretical findings in numerical simulations.
A Human-Sensitive Controller: Adapting to Human Musculoskeletal Disorder-Related Constraints via Reinforcement Learning
Work-Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long-term disability. This study introduces a human-sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning (RL) to develop a human-aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q-Learning and Deep Q-Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation-to-real gap, a fine-tuning phase successfully adapted the policies to real-world conditions. DQN outperformed Q-Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels, achieving on average 38% shorter task completion times across all tested anthropometries. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL-driven cobots to enable safer, more inclusive workplaces.
Structure-preserving Optimal Kron-based Reduction of Radial Distribution Networks
Network reduction simplifies complex electrical networks to address computational challenges of large-scale transmission and distribution grids. Traditional network reduction methods are often based on a predefined set of nodes or lines to remain in the reduced network. This paper builds upon previous work on optimal Kron-based reduction of networks, which was formulated as a mixed-integer linear program, to enhance the framework in three aspects. First, the scalability is improved via a cutting plane restriction, tightened Big M bounds, and a zero-injection node reduction stage. Next, we introduce a radiality-preservation step to identify and recover nodes whose restoration ensures radiality. A linearized voltage magnitude error constraint is incorporated to explicitly bound the difference between full and reduced networks. The model is validated through its application to the 533-bus distribution test system and a 3499-bus utility feeder for a set of representative loading scenarios. In the 533-bus system, an 85% reduction was achieved with a maximum voltage error below 0.0025 per unit, while in the 3499-bus feeder, over 94% reduction was obtained with maximum voltage errors below 0.002 per unit. Additionally, we show that the radialization step accelerates the runtime of optimal voltage control problems when applied to Kron-reduced networks.
Clipped Affine Policy: Low-Complexity Near-Optimal Online Power Control for Energy Harvesting Communications over Fading Channels
This paper studies online power control for battery-limited point-to-point energy harvesting communications over slow block-fading channels. A linear-policy-based approximation is developed for the relative-value function in the Bellman equation of the power control problem. This approximation leads to two fundamental parameterized clipped affine policies: an optimistic policy derived from a certainty-equivalence-type approximation and a robust policy derived from worst-case analysis. For independent and identically distributed energy arrivals and channel states, two families of power control schemes are developed based on the optimistic clipped affine (OCA) and robust clipped affine (RCA) policies, respectively. The proposed adaptive RCA policy based on reinforcement learning (RCA-RL) is further extended to address four scenarios with contextual information: one-step energy lookahead, one-step channel lookahead, one-step joint energy-channel lookahead, and Markov energy arrivals. Extensive simulation results show that the proposed schemes provide a favorable tradeoff between computational complexity and performance. The adaptive RCA policy based on the maximin optimal linear-policy-slope approximation (RCA-OLA-A) and the RCA-RL scheme achieve the best overall performance, while the RCA policy based on the maximin optimal linear policy (RCA-OL) is the best-performing closed-form policy. In particular, RCA-OLA-A, RCA-RL, and the aforementioned RCA-RL extensions achieve less than 2% performance loss relative to the optimal policy across a range of scenarios, consistently outperforming the considered benchmark approaches, including generic reinforcement learning baselines. The RCA-OL policy also performs well with less than 4% performance loss.
comment: 29 pages, 15 figures, v1.0
Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic
We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.
Recursive Experiment Design for Closed-Loop Identification of ARMAX Systems with Output Perturbation Limits
In many applications, system identification experiments must be performed in closed loop to ensure safety or to maintain system operation. In this paper, we consider the recursive design of informative experiments for ARMAX models by adding a bounded probing signal to the input generated by a fixed output feedback controller. The resulting output perturbations should be kept within user-specified limits. We analyze the identifiability and feasibility conditions of this setting and then proceed to derive a probing signal that can be efficiently computed in closed form. We demonstrate the effectiveness and properties of the design in numerical experiments.
Off-grid solar energy storage system with hybrid lithium iron phosphate (LFP) and lead-acid batteries in high mountains: a case report of Jiujiu Cabins in Taiwan
Mountain huts are buildings located at high altitude, offering a place for hikers and providing shelter. Energy supply on mountain huts is still an open issue. Using renewable energies could be an appropriate solution. Jiujiu Cabins, a famous mountain hut in Shei-Pa National Park, Taiwan, has operated an off-grid solar energy storage system (ESS) with lead-acid batteries. In 2021, a serious system failure took place, leading to no electricity. After a detailed on-site survey, a reorganization and repair project implemented, the energy system came back to operate normally. Meanwhile, an eco-friendly lithium iron phosphate battery (LFP battery) ESS replaces part of the lead-acid battery ESS, forming a hybrid ESS, making a better and green off-grid solar ESS. In this case report, the energy architecture, detailed descriptions, and historical status of the system are provided. An on-site survey of the failed energy system, a system improvement project, and future plan are listed.
comment: 14 pages, 9 figures, 4 tables
An Exponentially stable Extended Kalman Filter with Estimate dependent Process noise Covariance for Chemical Reaction Networks
Biomolecular systems are often modeled with partially known nonlinear stochastic dynamics, making state and parameter estimation a central challenge. While Kalman filtering techniques are widely used in this setting, their performance critically depends on the choice of the process noise covariance, which is typically assumed constant and heuristically tuned. Such assumptions are not justified for biomolecular systems, where intrinsic noise arises from underlying reaction kinetics. In previous works, a process noise covariance update based on the Chemical Langevin Equation (CLE) was introduced for Extended Kalman Filter (EKF)-based estimation in Chemical Reaction Networks (CRN). In this work, we analyze the stochastic stability of this filtering framework. In particular, we obtain a conservative upper bound on sampling interval for discrete-time biomolecular systems that ensures mean-square exponential boundedness under stated assumptions. The proposed framework is validated through simulations on a nonlinear gene expression model. The analysis provides theoretical justification for CLE-based process noise covariance modeling in EKF design for biomolecular circuits, reducing reliance on heuristic covariance tuning.
Reachability for Low-Thrust Trajectories via Maximum Initial Mass
Reachability analysis plays a central role in low-thrust spacecraft trajectory optimization by identifying which target states can be achieved under constraints on time, thrust, and propellant. Classical approaches construct reachable sets by solving many optimal control problems over grids of terminal states, requiring extensive forward simulations with fixed initial conditions. While effective, this approach is computationally expensive and becomes impractical for high-dimensional systems or strongly nonlinear dynamics, such as those encountered in cislunar environments or solar sail missions. This work introduces a dual formulation of the reachability problem. Instead of computing reachable sets directly, we determine, for fixed transfer time and boundary conditions, the maximum allowable initial mass (or, for solar sails, a scalar sail-strength parameter) that permits a successful transfer. A target is reachable if the spacecraft's initial mass does not exceed this threshold. This reformulation reduces reachability assessment to a scalar optimization problem for each target, producing a smooth scalar field that encodes equivalent feasibility information to classical reachable sets. We develop indirect maximum-initial-mass (MIM) formulations for both electric low-thrust and solar-sail dynamics and show how they can serve as efficient reachability oracles. Building on this formulation, we construct data-driven surrogate models to approximate the MIM-based reachability indicator. We investigate fully connected neural networks and demonstrate that residual networks provide the best trade-off between accuracy, training stability, and model complexity. The resulting surrogates enable rapid reachability evaluation while preserving the numerical advantages of the dual formulation, offering a practical tool for preliminary mission design and feasibility assessment.
comment: Presented at the 30th International Symposium on Space Flight Dynamics, 1-5 June 2026, Toulouse, France
Data-Driven Adaptive Second-Order Sliding Mode Control with Noisy Data
This paper proposes a data-driven approach to designing adaptive suboptimal second-order sliding mode (ASSOSM) controllers for a class of single-input nonlinear systems with partially unknown dynamics, subject to both matched and unmatched disturbances. We first view the system as comprising two coupled dynamics, referred to as the upper and lower dynamics, with the last state serving as a virtual input to the upper dynamics. The proposed control-design methodology then follows a two-stage procedure: (i) designing a virtual state-feedback control law for the upper dynamics and (ii) synthesizing an ASSOSM controller for the full-order system. To this end, we collect noise-corrupted data from the system throughout a finite-time experiment. We then formulate a data-dependent condition, whose feasibility enables the design of a virtual state-feedback control law that renders the closed-loop upper dynamics input-to-state stable with respect to the unmatched disturbance. Building on this virtual state-feedback control law, we subsequently propose a data-driven nonlinear sliding variable, based on which an ASSOSM controller is designed for the full-order system. The state trajectories of the resulting closed-loop system are semiglobally ultimately bounded (S-GUB), with the ultimate bound explicitly depending on the magnitude of the unmatched disturbance. In particular, the control design parameters can be selected for any prescribed bounded set of initial conditions so that the state trajectories of the closed-loop system are S-GUB. Moreover, the effect of the matched disturbance is totally rejected after a finite time. The effectiveness of the proposed method is satisfactorily demonstrated in the simulation.
ORIX: Orchestration of RIS with xApps for Smart Wireless Factory Environments
The vision of a smart wireless factory (SWF) demands highly flexible, low-latency, and reliable connectivity that goes beyond conventional wireless solutions. Reconfigurable intelligent surface (RIS)-empowered communications, when integrated with the open radio access network (O-RAN) architectures, have emerged as a promising enabler to meet these challenging requirements. This article introduces the methodology for the orchestration of RIS with xApps (ORIX), bringing the RIS technology into the O-RAN ecosystem through xApp-based control for SWF environments. ORIX features three key components: an O-RAN-compliant RIS service model for dynamic configuration, an RIS channel simulator that supports 3GPP indoor factory models with multiple industrial scenarios, and practical RIS optimization strategies with finite-resolution control. Together, these elements provide a realistic end-to-end emulation platform for evaluating RIS placement, control, and performance in SWF environments prior to deployment. The presented case study demonstrates how ORIX enables the evaluation of achievable performance gains, exploration of trade-offs among key RIS design parameters, and identification of deployment strategies that balance system performance with practical implementation constraints. By bridging theoretical advances with industrial feasibility, ORIX lays the groundwork for RIS-assisted O-RAN networks to power next-generation wireless communication in industrial scenarios.
comment: Accepted in IEEE Wireless Communications, Copyright IEEE
Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking
This article presents an error-state Linear Quadratic Regulator (LQR) formulation for robust trajectory tracking in quadrotor Unmanned Aerial Vehicles (UAVs). The proposed approach leverages error-state dynamics and employs exponential coordinates to represent orientation errors, enabling a linearized system representation for real-time control. The control strategy integrates an LQR-based full-state feedback controller for trajectory tracking, combined with a cascaded bodyrate controller to handle actuator dynamics. Detailed derivations of the error-state dynamics, the linearization process, and the controller design are provided, highlighting the applicability of the method for precise and stable quadrotor control in dynamic environments.
Agentic AI-Enhanced Semantic Communications: Foundations, Architecture, and Applications
Semantic communications (SemCom), as one of the key technologies for 6G, is shifting networks from bit transmission to semantic information exchange. On this basis, introducing agentic artificial intelligence (AI) with perception, memory, reasoning, and action capabilities provides a practicable path to intelligent communications. This paper provides a systematic exposition of how agentic AI empowers SemCom from the perspectives of research foundations, system architecture, and application scenarios. We first provide a comprehensive review of existing studies by agent types, covering embedded agents, large language model (LLM)/large vision model (LVM) agents, and reinforcement learning (RL) agents. Additionally, we propose a unified agentic AI-enhanced SemCom framework covering the application layer, the semantic layer, and the cloud-edge collaboration layer, forming a closed loop from intent to encoding to transmission to decoding to action to evaluation. We also present several typical scenarios, including multi-vehicle collaborative perception, multi-robot cooperative rescue, and agentic operations for intellicise (intelligent and concise) networks. Furthermore, we introduce an agentic knowledge base (KB)-based joint source-channel coding case study, AKB-JSCC, where the source KB and channel KB are built by LLM/LVM agents and RL agents, respectively. Experimental results show that AKB-JSCC achieves higher information reconstruction quality under different channel conditions. Finally, we discuss future evolution and research directions, providing a reference for portable, verifiable, and controllable research and deployment of agentic SemCom.
comment: This version is being withdrawn because we need to further reevaluate the attribution of contributions among the authors
Toward Trustworthy Digital Twins in AI Agent-based Wireless Network Optimization: Challenges, Solutions, and Opportunities
Optimizing modern wireless networks is exceptionally challenging due to their high dynamism and complexity. While the AI agent powered by reinforcement learning (RL) offers a promising solution, its practical application is limited by prohibitive exploration costs and potential risks in the real world. The emerging digital twin (DT) technology provides a safe and controlled virtual environment for agent training, but its effectiveness critically depends on the DT's reliability. Policies trained in an unreliable DT that does not accurately represent the physical network may experience severe performance degradation upon real-world deployment. In this article, we introduce a new DT evaluation framework to ensure trustworthy DTs in AI agent-based network optimization. This framework shifts from model-level accuracy, such as wireless channel and user trajectory similarities, to a more holistic, task-centric DT assessment, which relies on the Markov decision process that the agent actually perceives. We demonstrate it as an effective guideline for design, selection, and lifecycle management of wireless network DTs. A comprehensive case study on a real-world wireless network testbed shows how this evaluation framework is used to pre-filter candidate DTs, leading to a significant reduction in training and testing costs without sacrificing deployment performance. Finally, potential research opportunities are discussed.
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
comment: Fix minor metadata mistakes
Pitot-Aided Attitude and Air Velocity Estimation with Almost Global Asymptotic Stability Guarantees
This paper investigates the problem of attitude and air velocity estimation for fixed-wing unmanned aerial vehicles (UAVs) using IMU measurements and at least one Pitot tube measurement, with almost global asymptotic stability (AGAS) guarantees. A cascade observer architecture is developed, in which a Riccati/Kalman-type filter estimates the body-fixed frame air velocity and the vehicle's tilt using IMU data as inputs and Pitot measurements as outputs. Under mild excitation conditions, the resulting air velocity and tilt estimation error dynamics are shown to be uniformly observable. The estimated tilt is then combined with magnetometer measurements in a nonlinear observer on SO(3) to recover the full attitude. Rigorous analysis establishes AGAS of the overall cascade structure under the uniform observability (UO) condition. The effectiveness of the proposed approach is demonstrated through validation on real flight data.
comment: 8 pages, 8 figures. Accepted at IEEE Conference on Control Technology and Applications (CCTA) 2026
A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic decision framework that resolves this trade-off by formulating method selection as a two-player zero-sum game between a Monitor (selecting computational methods and parameters) and Nature (selecting the unknown traffic scenario). We construct an end-to-end pipeline from trajectory surveillance data through eight candidate detection algorithms, a Monte Carlo sensitivity analysis characterizing their stochastic performance, and finally a multi-objective optimization layer that identifies Pareto-optimal method portfolios. The minimax solution provides a robust mixed strategy with a probability distribution over methods that guarantees worst-case performance regardless of scenario uncertainty. Experimental evaluation across 200 randomized configurations spanning 5--50 aircraft demonstrates that the framework recommends distinct method portfolios depending on operational priority: Koopman Phase dominates balanced (70.6%) and speed-priority (79.7%) profiles, while CRQA emerges as primary (47.4%) when route-lead identification is prioritized. The framework achieves a guaranteed game value of 0.29--0.53 (normalized utility) across all tested preference profiles, providing the first principled, scenario-adaptive methodology for computational method selection in UTM fleet monitoring operations.
Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning
Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.
comment: 17 pages, 7 figures
Robotics
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
comment: 22 pages, 9 figures
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
Flow-based Policy Adaptation without Policy Updates
Leveraging prior knowledge from pretrained policies, foundation models, or human operators offers an efficient alternative to learning robot skills from scratch. However, these agents often provide actions that are suboptimal, noisy, or misaligned with task-specific expert behavior. We propose GLOVES, a family of flow-based adaptation methods that correct non-expert actions by transporting them toward an expert action distribution. Rather than replacing agentic control with full autonomy, GLOVES performs selective action-level adaptation, improving task success while preserving agent intent. The learned flow also provides a natural in-distribution scoring mechanism through reverse flow evaluation. We use this signal as an intervention gate: actions that appear consistent with the expert distribution are passed through unchanged, while anomalous or out-of-distribution (OOD) actions are corrected. In this way, assistance is only provided when necessary. GLOVES requires only limited expert supervision, using a small number of demonstrations or reusable successful skill segments. By learning local expert action patterns and stitching them during execution, GLOVES provides a lightweight shared-control module for robust action adaptation across tasks and environments. Code and demos are available at ripl.github.io/GLOVES_web.
RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation
Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.
Ensuring Interaction Safety in Multitask Exoskeleton Control: A Simulation-Trained Variable Impedance Framework
Wearable exoskeletons can augment human phys ical capabilities during complex activities. However, ensuring adaptation across diverse tasks while guaranteeing interaction safety remains a critical challenge. To address this, a simulation trained variable impedance control approach with stability guarantees is proposed. First, a simulation-based human exoskeleton motion data generation pipeline is established, utilizing Proximal Policy Optimization (PPO) to synthesize human muscle activations while the exoskeleton provides direct compensation for human biological joint torques. Subsequently, the generated dataset is used to train a dual modality policy that fuses semantic instructions with proprioceptive history, enabling the prediction of reference trajectories and variable impedance gains for nine different motion tasks. To guarantee safety, the network outputs are constrained by a stability criterion derived from Lyapunov stability theory, which bounds stiffness variations to ensure the asymptotic stability of the coupled system. Experimental results indicate that the proposed framework reduces metabolic cost in real-world scenarios com pared with standard baseline methods. These findings suggest the feasibility of the proposed framework for safe, multitask exoskeleton control.
Waypoints Matter: A Systematic Study for Sampling-Based Trajectory Planning SC 2026
Real-time autonomous driving commonly relies on sampling-based trajectory planners that link candidate trajectories to target waypoints along the road centerline. The placement of these waypoints directly impacts both the existence and quality of feasible trajectories. Yet, its effect on planner performance remains largely unexplored. In this paper, we treat waypoint placement as a first-class design variable. We hold the trajectory primitive and candidate budget fixed, and systematically sweep three placement strategies (uniform spacing, an augmented Ramer-Douglas-Peucker variant (RDP*), and a novel curvature-conditioned allocation) across 449 configurations and five CommonRoad maps of increasing geometric complexity. Our results show that the nominal inter-waypoint spacing $d_s$ is the primary performance driver, with large differences in planner reliability attributed to placement alone. Uniform sampling at a well-tuned spacing matches or surpasses both RDP* and the centered curvature variant. The curvature variant offers a small but consistent advantage on geometrically complex roads under reliability-first and balanced weightings, while RDP* never outperforms uniform sampling. These findings suggest that $d_s$ should be treated as the dominant tuning parameter, with geometry-aware strategies reserved for curvature-rich corridors where feasibility is the limiting factor.
comment: 8 pages, 5 figures, 3 tables; accepted at IEEE ITSC 2026
VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies
Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.
Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments
Successful robot automation requires accurate global localization to support repeatability, task planning, goal specification, and safe operation. However, reliable localization in GNSS-denied environments remains an open problem. Overhead aerial imagery offers a promising solution, but existing approaches primarily target structured urban environments and have been rarely demonstrated in unstructured natural terrain. Limitations of the state-of-the-art include a reliance on models trained for specific environments, as well as difficulty handling repetitive geometries and featureless landscapes commonly found in natural outdoor areas. To overcome these challenges, we present Meridian, a method for matching high-level metric-semantic primitives across aerial images and ground robot RGB-D camera data that achieves accurate global localization and generalizes well across diverse environments, all without any training or algorithmic fine-tuning on area-specific data. We formulate novel consistency metrics to estimate a distribution over robot submap poses and to reject outlier hypotheses in a robust pose graph optimization step for accurate robot trajectory estimation. We demonstrate that our algorithm can localize a ground robot across a wide variety of environments, including an autonomous driving dataset, a park and campus area, and a wilderness camp, with an average optimized trajectory error of 2.4 m over 19 km of ground traversal.
comment: 9 pages, 6 figures
Attitude-Aided Linear Calibration of Triaxial Accelerometers
Triaxial MEMS accelerometers are widely used for inertial sensing, navigation, and sensor fusion, but existing calibration methods often rely on costly reference setups or nonlinear iterative optimization, limiting their efficiency and applicability to low-cost or self-calibrating systems. We present attitude-aided linear accelerometer calibration (ALAC), a method that operates on any platform providing orientation information, such as turntables, robotic arms, or inertial measurement units. ALAC constructs a combined error matrix (CEM) to represent sensor errors in a unified calibration model and enables linear least-squares estimation. The bias and gravity vector are jointly estimated, implicitly accounting for platform misalignment, and matrix decomposition of the CEM recovers scale, non-orthogonality, and alignment rotation parameters. Under static gravity, calibration is formulated as a constrained homogeneous least-squares (CHLS) problem and solved in closed form using standard linear algebra. Only five arbitrarily oriented measurements are required, and a recursive extension supports online or in-field calibration. Experiments on a stationary robot-mounted accelerometer and a quasi-static public IMU trajectory show that ALAC, in both offline and online modes, outperforms reference-based and online baselines in accuracy and robustness to sensor noise. On the same dataset, it matches iterative self-calibration under filtered conditions and surpasses all evaluated baselines on raw measurements. These results demonstrate a robust and practical calibration scheme for MEMS-based inertial platforms, especially low-cost IMUs and online calibration scenarios.
Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation
Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.
Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation
Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.
comment: 20 pages, preprint
RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning
Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.
comment: 28 pages,15 figures
Breaking Time: A Fully Gaussian Framework for Distributed and Continuous-Time SLAM
Continuous-time SLAM provides a principled framework for fusing heterogeneous sensors while estimating smooth trajectories, and is particularly well-suited for handling heterogeneous, asynchronous sensor streams with non-uniform readout patterns, such as rolling shutter cameras, LiDAR scanners, radar sweeps, or event-based sensors. In this work, we introduce G-solver, a fully Gaussian and distributed framework that combines Gaussian Belief Propagation (GBP) with Gaussian Process (GP) motion priors for continuous-time trajectory estimation. Our GP model provides a probabilistic representation of the trajectory, enabling consistent interpolation and the use of data-driven hyperparameters, while GBP offers a scalable message-passing formulation well-suited for decentralized settings. The resulting solver naturally extends to multi-camera scenarios without specialized synchronization or engineering effort. We evaluate the approach on synthetic and real data, including rolling shutter and distributed multi-camera optimization, demonstrating accurate and stable estimation with runtimes comparable to existing continuous-time methods. An open-source implementation is released.
comment: To be published in RA-L. Open-source implementation is released at https://github.com/rvp-group/gsolver
MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action
Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.
comment: 14 pages, 5 figures, submitted to CoRL
CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.
TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation
A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.
ActiveMimic: Egocentric Video Pretraining with Active Perception
Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.
comment: Project Page: https://activemimic.github.io/
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
comment: Preprint. Code and project page are available. Code: https://github.com/Skywalker-yqz/AffordanceVLA Project page: https://skywalker-yqz.github.io/AffordanceVLA/
MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation
We present MotionDisco, a framework that discovers contact-rich, long-horizon humanoid loco-manipulation motions from scratch, without relying on teleoperation or motion retargeting from human demonstrations. This is challenging because the space of possible contact interactions grows combinatorially with the task horizon and the number of objects in the scene. MotionDisco enables rapid discovery of novel motions by coupling a large language model (LLM) guided evolutionary search over sequences of interactions with an efficient sequential kinodynamic trajectory optimizer and pruning strategy, enabling the rapid discovery of novel skills. Through extensive ablation studies, we show that our LLM-guided search discovers successful whole-body trajectories across several challenging long-horizon tasks. Finally, by training reinforcement learning tracking policies on the discovered trajectories, we transfer the motions to a real humanoid robot. This is the first work to discover and deploy long-horizon humanoid loco-manipulation skills entirely through automated evolutionary search. Supplementary videos of the experiments are available at: https://youtu.be/DHiVz34QYlw.
Towards Realistic 3D Sonar Simulation
As underwater robotics research increasingly addresses complex 3D perception and autonomous navigation, the fidelity of sonar simulation has become a key factor in algorithm development. Current simulation frameworks typically rely on geometry-driven rendering, approximating 3D sonar as an underwater equivalent to LiDAR, which fails to account for fundamental acoustic phenomena such as refraction, multi-path interference, and phase-dependent signal formation. This paper proposes a modular architecture for realistic 3D sonar simulation that integrates GPU-accelerated graphics engines with physically grounded acoustic propagation principles. We implement a volumetric 3D sonar model within the NVIDIA Isaac Sim environment, modeled after the Water Linked 3D-15 sensor, and integrate it into a comprehensive underwater simulation framework. The system is validated through a hardware-in-the-loop configuration, where a modified FastLIO2 SLAM pipeline, executed on an NVIDIA Jetson Orin Nano, performs sensor fusion using synthetic 3D sonar, DVL, IMU, and pressure data. Finally, a qualitative comparison between simulated outputs and real-world data from harbor sheet-pile inspections is provided, characterizing the remaining sim-to-real gap and establishing a roadmap toward fully acoustics-driven volumetric sensing.
3D Underwater Path Planning via Generative Flow Field Surrogates
Autonomous underwater vehicle (AUV) launch and recovery (LAR) into the hull of an advancing host platform requires traversal of a complex, three-dimensional propeller wake whose hydrodynamic structure cannot be characterised by a uniform current model. High-fidelity Reynolds-Averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) simulations resolve this structure with sufficient accuracy for path planning, but their computational cost renders them impractical for onboard use. We address this gap by integrating two conditional generative adversarial network (cGAN) architectures -- a regularised PatchGAN and a 2D3DGAN with self-attention -- as drop-in replacements for RANS CFD data within a three-dimensional, energy-weighted A* path planning framework. Both generators are driven by a hierarchical pipeline that synthesises full $128^3$ voxel flow field volumes from scalar operating condition inputs alone, with end-to-end inference times of approximately 28-146 $μ$s, compared to hours for a single RANS computation. We benchmark all four environmental knowledge levels: uniform current, ground-truth CFD, PatchGAN, and 2D3DGAN~SA across 19,800 independently generated trajectories spanning 550 distinct flow conditions. Full CFD wake knowledge reduces energy expenditure by 5.7-12.5% and high-velocity wake-core encounters by up to 77.8% relative to uniform-current planning, with both benefits scaling with operating severity. The cGAN surrogates recover approximately 45-60% of the CFD energy benefit and high-velocity cell avoidance benefit while operating at inference speeds compatible with edge device use. These results provide the first systematic quantification of the downstream path planning value of cGAN-predicted hydrodynamic fields in a three-dimensional maritime robotics application.
comment: 41 pages, 5 figures, 11 tables
A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models
This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while maintaining a responsive control loop. From free-form user commands, the system generates structured action requests for pick, place, and handover. It uses a VLM to return image-space targets, which are converted into metric robot-frame goals using depth and calibration. A web dashboard exposes intermediate intent and grounding overlays (pixel, depth, and robot-frame) and requires explicit operator confirmation before any motion is executed. Experiments on a Franka FR3 platform evaluate end-to-end task reliability and latency under increasing working table scene ambiguity and compare alternative LLM/VLM configurations in the same pipeline. Code and full documentation are available at [github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm).
comment: Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). The final published version will appear under the title "A Distributed Conversational Framework for Human-Robot Collaborative Manipulation Using Local LLMs and VLMs"
L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation
Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.
Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning ICANN
As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.
comment: 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted
Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots
Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.
comment: Accepted to IEEE Robotics & Automation Letters
RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning
Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi
PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.
Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies
In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.
comment: 12 pages, 8 figures, 7 tables
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.
comment: 19 pages, 10 figures
T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation
Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.
Towards a Data Flywheel for Embodied Intelligence in Logistics
Embodied intelligence is moving from laboratory demonstrations toward industrial deployment, with the logistics industry serving as a key application scenario. Learning-based policies offer a promising path beyond traditional perception-planning-control pipelines, but their scalability depends on how embodied data can be collected, organized, and reused. This research studies a data-centric framework for industrial embodied intelligence by constructing a logistics data flywheel. Our framework converts daily operations into reusable data assets, uses World Models to generate reliable supervision for long-tail parcel manipulation, and feeds deployment feedback back into policy improvement. As an initial result, \textit{WM-DAgger} introduces a World-Model-based data aggregation framework that synthesizes out-of-distribution recovery data for robust imitation learning. Building on this result, ongoing work explores how large-scale in-the-wild multimodal data, including labeled human demonstrations, unlabeled operational videos, and system-level robot logs, can be aligned for policy learning and transformed into feedback for continual system improvement.
Learning of Robot Safety Policies via Adversarial Synthetic Scenarios
In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.
A Novel Method with Encoder-Decoder for Cross-Sensor Adaptation in Surface Shape Sensing with Sparse Strain Sensors
Performance variations in sensor arrays, caused by intrinsic differences or installation conditions, can lead to inconsistent results during shape sensing. To obtain accurate results, a large amount of data is usually required, and a separate model must be retrained for each sensor array, thereby increasing the cost and time of data acquisition, transmission, and computation. To address this issue, this work proposes an encoder-decoder architecture for surface shape sensing based on sparse strain sensors and further incorporates meta-learning and few-shot adaptation strategies to enable adaptation across different groups of sensor arrays. Experimental results demonstrate that, after the cross-sensor adaptation, a newly deployed sensor array achieves a sensing error of approximately 4.0 mm relying on less than 5.0% newly labeled data and requiring an adaptation time of under 1 second, which represents a substantial improvement from 23.0 mm error without adaptation and 20-minute data collection time required to train a new model. Moreover, the number of points with errors below 5.0 mm increased by more than 65.0%. These results indicate that the proposed method can substantially reduce the cost and training burden of surface shape sensing, and it has broad potential applications in soft robotics and wearable devices.
TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion
Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.
LadderMan: Learning Humanoid Perceptive Ladder Climbing
Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .
Visuotactile and Explicitly Force-Controlled Robotic Ultrasound for Abdominal Volumetric Reconstruction
In this paper, we present a robotic ultrasound acquisition system that integrates stereo vision, touch-based feedback, and expert-informed strategies to perform autonomous and adaptive abdominal scans. The system records freehand motion and force data from expert radiologists, creating a framework to capture transducer motion, applied forces, and anatomical scanning strategies. This expert data is replayed to replicate characteristic scans with the robot, forming a foundation for further autonomous capabilities. Using stereo vision, the system generates three-dimensional topography maps of the patient's abdomen, which are refined through stiffness measurements at key points to delineate the rib cage boundary. These combined techniques enable the robot to execute two distinct scanning paths: an upward-angled sweep beneath the rib cage to visualize structures near the upper abdomen and a perpendicular sweep across soft tissue regions. A compliant, torque-controlled seven degree-of-freedom robotic manipulator is controlled to maintain consistent probe contact through closed-loop force control over the varied anatomical surfaces. Physical experiments demonstrate that the system achieves high-quality imaging comparable to expert scans while dynamically adapting to patient-specific topographies. Furthermore, the robotic system surpasses expert capabilities by enabling three-dimensional volume acquisition, which enhances diagnostic potential and provides volumetric data for advanced analyses. This work highlights the integration of expert knowledge into autonomous robotic systems and underscores the potential of combining perception-based autonomy with physical reasoning for enhanced diagnostic performance.
Amortized Nonlinear Model Predictive Control
Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.
comment: 6 pages
PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation
Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.
Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models
Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.
comment: 20 pages, 10 figures
DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use
Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.
Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation
In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.
comment: 8 pages, 5 figures
Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems
Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.
Preserving Full 6-DOF Actuation Under Abrupt Total Rotor Failures: Passive Fault-Tolerant Flight Control Using a Biaxial-Tilt Hexacopter
Conventional multirotors suffer from a rapid collapse of attainable wrench space (AWS) under abrupt total rotor failures, rendering full 6-DOF recovery physically impossible. This paper addresses passive fault-tolerant flight of a biaxial-tilt overactuated hexacopter (BTO) under abrupt total rotor failures that are a priori unknown to the controller. The control design and analysis focus on representative abrupt rotor-failure cases for which the post-failure system remains fully actuated, while no explicit fault detection, isolation, or fault-mode switching is assumed. First, we extend the inscribed-sphere metric of the AWS by incorporating the transient-wrench-jump term, enabling quantitative feasibility assessment under up to three simultaneous rotor failures and benchmarking against uniaxial-tilt and coplanar hexacopters. Second, we develop two computationally efficient passive schemes without relying on fault detection or online optimization. One scheme operates at the controller layer by combining a high-order fully actuated (HOFA) controller with a linear extended state observer (LESO) for lumped-disturbance rejection. The other scheme operates at the allocator layer by using model-reference adaptive control allocation with momentum-based wrench estimation to compensate for control-allocation biases. Simulations and flight experiments validate stable hovering and 6-DOF trajectory tracking under single and multiple rotor failures. Further systematic comparisons confirm that the BTO provides larger recovery margins than uniaxial-tilt and coplanar designs. Additional onboard-sensor-only experiments, including indoor tracking under wind disturbance, outdoor tracking under extreme conditions, narrow-frame traversal, and contact-based aerial writing, further validate the robustness of the proposed framework in complex operational environments.
Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation
Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.
comment: 63 pages, 6 figures
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.
Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies
Imitation-learning policies inherit the quality of the demonstrations they are trained on, and a growing set of curation metrics promise to score and filter low-quality demonstrations automatically. These metrics are each validated on different data with different protocols, so it is unclear which of them actually identify the demonstrations that harm a policy. We build a controlled testbed in which demonstration defects are injected with known type, and audit seven curation metrics along two axes: how well each separates defective from clean demonstrations, and whether training a behavior-cloning policy on each metric's curated subset improves task success. We study two defect regimes. Subtle perturbations (correlated action noise, tremor, truncation) are detectable by multivariate outlier scoring and, once removed, recover the full downstream gap. Structural errors, where the demonstration executes a wrong action at a key moment, are invisible to every action-only metric we test, and two of them are inverted: they score defective demonstrations as higher quality and, used for curation, tend to leave the policy at or below the uncurated baseline rather than above it. Only metrics that examine the state trajectory detect structural errors, and even the best of them recovers just a third of the downstream gap. High detection accuracy does not guarantee downstream improvement. We release the testbed and all curation implementations.
comment: 5 pages, 3 figures, 4 tables
Wave Focusing in Metamaterials: Tactile Displays Beyond the Diffraction Limit
We address the challenge of engineering distributed haptic displays capable of reproducing multiple localized, independently addressable vibrations -- representing virtual tactile pixels -- at arbitrary locations on a surface. Our technique is based on the focusing of mechanical waves in a flexural plate using a sparse set of actuators. At tactile frequencies, wave diffraction prevents the formation of localized virtual tactile pixels at spatial scales relevant for multi-digit touch interactions. We overcome this limitation by augmenting the plate with a lattice of mechanical resonators, forming a locally resonant metamaterial plate. Coupling between the plate's dynamic modes and those of the resonators alters the dispersion relation governing wave transmission, introducing a slow-wave branch that enables focusing beyond the diffraction limit imposed by the unmodified plate. We use numerical simulations to engineer the dispersion relation of the metamaterial system for high-resolution focusing at tactile frequencies. We then fabricate a metamaterial tactile display and experimentally demonstrate virtual pixels that are far more localized than those generated on an otherwise identical plate without resonators, resulting in a tenfold reduction in virtual-pixel area. In behavioral experiments, we show that this system can deliver perceptually localized single- and multi-point tactile feedback and moving tactile sources while maintaining independent control over temporal waveforms at multiple display locations. The methods reported here can enable high-resolution haptic displays for widespread applications using a small number of actuated degrees of freedom.
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.
comment: Code, videos, and data available at: https://A4Dance-reasoning.github.io
Multi-Robot Planning and Control from CCTV Camera Networks in a Real Warehouse
Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalable autonomy, moving sensing and compute off the robots. We extend this idea from the single-robot case to coordinated fleets in a real warehouse, driving multiple robots with only a distributed CCTV network and edge compute. The system operates entirely in image space over an uncalibrated, pixel-wise topological camera graph, enabling wide-area operation with flexible camera placement. A hierarchical planner selects a camera sequence per robot and plans its image-space motion through each view, coordinating robots with a prioritised-then-joint strategy and treating overlapping camera regions as shared resources held by one robot at a time to prevent collisions and deadlocks. We validate the approach in a real warehouse with four robots and 30 cameras across six 27 m aisles, reporting mission times and coordination statistics. To our knowledge, this is the first field demonstration of multi-robot planning and coordination using only an external camera network and off-board compute, with robots carrying no task-specific navigation hardware.
AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation
Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.
comment: Accepted to Robotics: Science and Systems (RSS) 2026
IDDMBSE: Integrating Data-Driven and Model-Based Systems Engineering for Trusted Autonomous Cyber-Physical Systems
Autonomous cyber-physical systems (CPS) sit at the intersection of Model-Based Systems Engineering (MBSE) and data-driven Machine Learning and Artificial Intelligence (ML/AI), yet no integrated Systems Engineering (SE) methodology natively spans both. We address this gap with IDDMBSE, an Integrated Data-Driven and Model-Based Systems Engineering methodology that extends the rigorous MBSE V-process with a data-driven loop at every step, anchored in SysML, the autonomy stack, and a hybrid model-based plus data-driven trade-off architecture. We instantiate IDDMBSE as an interoperable, open-source tool chain: PERFECT, which maps SysML system architectures to executable ROS autonomy stacks for scalable performance evaluation; TRADES-X, which decomposes design-space exploration into a model-based optimization stage followed by a data-driven evaluation stage; and VERITAS, which combines formal, data-driven, and runtime verification into a single assurance workflow. We demonstrate IDDMBSE on a Trusted Autonomous Ground Robot across its development lifecycle, spanning sensor-suite selection, risk-sensitive path planning, behavior-tree task verification, conformal-prediction-based robust perception, and assured multi-robot coordination, all exercised in a contested-terrain Isaac Sim test range that we release with the tool chain. We close by sketching how IDDMBSE is being re-formulated on SysML v2 / KerML foundations to enable language-native composability and tighter ML/AI integration.
comment: 9 pages, 11 figures. This work has been submitted to the IEEE for possible publication
SCOUT: Semantic scene COverage via Uncertainty-guided Traversal ICRA
Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.
comment: 2026 ICRA Workshop on Uncertainty in Open World Robotics
Optimal Control Approach for Non-prehensile Ball Juggling Using a 7-DoF Manipulator ICRA 2026
Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.
comment: 8 pages, accepted at ICRA 2026
On the Hardness of Optimal Motion on Trees
This paper presents a simple framework that settles the complexity of Multi-Agent Path Finding (MAPF) on trees across standard objectives--distance, makespan, and flowtime--for both labeled and colored variants. In MAPF, agents occupy the vertices of a graph and must move to target vertices without collisions while optimizing a given objective. In the labeled case, the agents are distinct and have respective targets; in the colored case, agents of the same color are interchangeable. While many MAPF variants are known to be intractable, several basic cases on trees have remained open. We prove NP-hardness on trees for both labeled and 2-colored MAPF under all three objectives. In particular, we resolve the classical Pebble Motion problem, where one pebble moves at a time to an adjacent empty vertex and the goal is to minimize the total number of moves. Despite being one of the most basic discrete motion models, its complexity on trees had remained open for several decades. Moreover, for colored Pebble Motion, we give the first hardness result on any graph class, already with two colors, which is tight. All of these results are established through the hardness of Stack Rearrangement, itself posed as an open problem, which asks to optimally rearrange items stored in stacks, and which we also prove to be NP-hard. Notably, the connection to stacks yields hardness already on very simple trees--subdivided stars--across all problems. Together, these results reveal a common tractability barrier that permeates several fundamental motion models, thereby unifying and strengthening prior hardness results.
AEGIS: A Backup Reflex for Physical AI
Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.
What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?
Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.
comment: The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html
ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition
How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.
comment: 40 pages, 4 figures, 7 tables, 3 algorithms
PhyRoGen: Synthetic Generation of Physical Robot Manipulation Puzzles Using Procedural Content Generation
Robot manipulation of physical puzzles is important for automatic assembly and disassembly tasks. However, to enable robots to solve physical puzzles, manipulation skills need to be learned, which requires large training datasets, the generation of which is often time consuming and tedious. To overcome this problem, we propose the Physical Robot Manipulation Puzzle Generation framework (PhyRoGen), which leverages procedural content generation (PCG) for automated generation of synthetic datasets of manipulation puzzles. PhyRoGen is a general-purpose puzzle generator, which can generate physical puzzles with interlocking object dependencies, where one articulated object must be manipulated before another can be moved. Based upon PhyRoGen, we define six concrete generators which we use to generate 24 physical puzzles. By using a benchmarking framework, we are able to solve all puzzles in 1 to 300 seconds using sampling-based planning algorithms. Finally, we demonstrate that every generated puzzle is manipulatable by using a KUKA LBR iiwa robot in a physical simulation. This shows that our framework is able to procedurally generate unique, solvable robot manipulation puzzles, which is a crucial ingredient to benchmark manipulation algorithms and to develop robust foundation models.
comment: 8 pages, accepted at CASE 2026
Robots Need More than VLA and World Models
Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.
From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution
In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 50 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
comment: Project website: https://open-h.github.io/open-h-embodiment/
PHUMA: Physically Reliable Humanoid Locomotion Dataset
Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often suffer from physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. To address this, we introduce PHUMA, a Physically Reliable HUMAnoid locomotion dataset produced by a two-stage pipeline combining physics-aware curation and physics-constrained retargeting, aggregating both motion capture and internet video into a physically reliable, 73-hour corpus. On motion tracking benchmarks, PHUMA-trained policies achieve higher success rates than those trained on AMASS and Humanoid-X, and successfully transfer zero-shot to a real Unitree G1. The code is available at https://davian-robotics.github.io/PHUMA.
Learning Predictive Visuomotor Coordination CVPR 2026
Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.
comment: CVPR 2026 Findings
SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation
Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.
OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics
We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.
comment: Project page: https://wuzy2115.github.io/oscar-project-page/
Is Diversity All You Need for Scalable Robotic Manipulation?
Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.
comment: Code is available at https://github.com/OpenDriveLab/AgiBot-World
EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration
Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.
comment: Project page: https://opendrivelab.com/EgoHumanoid
Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models
Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an RL-based sim-real Co-training (RL-Co) framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.
Enhancing Multi-Robot Exploration Using Probabilistic Frontier Prioritization with Dirichlet Process Gaussian Mixtures
Multi-agent autonomous exploration is essential for applications such as environmental monitoring, search and rescue, and industrial-scale surveillance. However, effective coordination under communication constraints remains a significant challenge. Frontier exploration algorithms analyze the boundary between the known and unknown regions to determine the next-best view that maximizes exploratory gain. This article proposes an enhancement to existing frontier-based exploration algorithms by introducing a probabilistic approach to frontier prioritization. By leveraging Dirichlet process Gaussian mixture model (DP-GMM) and a probabilistic formulation of information gain, the method improves the quality of frontier prioritization. The proposed enhancement, integrated into two state-of-the-art multi-agent exploration algorithms, consistently improves performance across environments of varying clutter, communication constraints, and team sizes. Simulations showcase an average gain of $10\%$ and $14\%$ for the two algorithms across all combinations. Successful deployment in real-world experiments with a dual-drone system further corroborates these findings.
comment: Accepted: IEEE Robotics and Automation Letters (RA-L)
Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System
This study proposes a reinforcement learning-based framework for adaptive running motion simulation in a unilateral transtibial amputee using a hybrid-link system that incorporates the flexibility of a leaf-spring-type sports prosthesis. The design and selection of sports prostheses typically rely on trial and error. A comprehensive whole-body dynamics analysis that accounts for interactions between human motion and prosthetic deformation can provide valuable insights for user-specific design and selection. The proposed hybrid-link system enables such analysis by integrating a Piece-wise Constant Strain (PCS) model to represent prosthetic flexibility. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee using a reinforcement learning approach. This framework integrates imitation learning based on motion capture data with accurate computation of prosthetic dynamics. Running motions are simulated under multiple virtual prosthetic stiffness conditions, and the corresponding metabolic cost of transport (COT) obtained from these simulations is analyzed. The results suggest that variations in prosthetic stiffness influence running dynamics and performance, and that COT is consistent with values reported in prior study. Our findings demonstrate the potential of the proposed approach for simulation and analysis under virtual conditions that differ from real-world conditions.
Test-Time Training for Visual Foresight Vision-Language-Action Models ICML 2026
Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.
comment: Accepted at ICML 2026 Workshop on Continual Adaptation at Scale (CATS)
VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.
comment: Corrected the typing error
Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation
Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Simulation experiments show that MCB improves early-stage learning efficiency by reaching target success-rate levels with fewer interactions, with small collision budgets producing the most consistent gains. Real-world experiments on heterogeneous robot platforms further validate the deployability of the learned policies in cluttered environments.
comment: 8 pages, 9 figures
ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation
Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can often be guided by novelty over states or dynamics. In contrast, dexterous manipulation requires rich physical hand--object interactions, but existing methods often suffer from unstable contact-based novelty signals, inefficient distance novelty signals, or reliance on task-specific priors. We propose ContactExplorer, a general exploration method for dexterous manipulation tasks. ContactExplorer represents contact as the intersection between object surface points and hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate ContactExplorer on a diverse set of dexterous manipulation tasks. Experimental results show that ContactExplorer substantially improves sample efficiency and success rates over existing exploration methods, and that the contact patterns learned with ContactExplorer transfer robustly to the real world. Project page is https://contact-explorer.github.io.
comment: 24 pages
ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model CVPR 2026
Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.
comment: CVPR 2026
ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios
Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.
comment: v2: Minor numerical corrections for Table V. 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at https://github.com/BerkeleyLearnVerify/ScenicRules
Safety by Invariance, Liveness through Refinement: Heterogeneous Contract Framework for Co-Design of Layered Control
Real-world control systems must achieve long-horizon objectives (liveness) while respecting continuous-time safety constraints, a combination that motivates hierarchical layered control architectures (LCAs). Existing LCA research, however, lacks (i) a uniform specification language across discrete planning and continuous execution, (ii) formal guarantees that specifications are preserved when interconnecting subsystems at heterogeneous time scales, and (iii) compositional separation between layers, owing to reliance on naive input-filtering laws. This paper addresses all three gaps by importing the safety--liveness decomposition into a heterogeneous assume--guarantee framework: \emph{safety is enforced by invariance} at the continuous-time layer, while \emph{liveness is achieved through refinement} at the discrete-time layer, with inter-layer coordination formalized via vertical refinement and timing-compatibility conditions. We instantiate this contract with a novel LCA combining an MPC planner, an input-to-state stabilizing (ISS) low-level controller, and a reference-governor bridge, and validate it on a Hybrid Energy Storage System (HESS) comprising a battery and a supercapacitor.
comment: 21 pages
Multiagent Systems
Unsupervised Skill Discovery for Agentic Data Analysis
Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.
comment: Work in progress
Emergent Language as an Approach to Conscious AI
The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
comment: Source codes available at https://github.com/wuzengqing001225/ConsciousAI_Indexicality/
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.
DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RAN
O-RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi-vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial-of-Service and performance-degradation attacks, which account for the majority of catalogued O-RAN threats, are particularly difficult to detect. Traditional Time-Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high-dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero-shot multi-agent framework for cross-interface anomaly detection in O-RAN that chains a three-stage VLM $\rightarrow$ LLM $\rightarrow$ VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per-interface descriptions against O-RAN domain knowledge, and verifies suspects on high-resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O-RAN WG11-aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O-RAN testbed under representative performance degradation scenarios, achieving 0.910 F1-Score and 0.843 Accuracy, outperforming state-of-the-art TSAD baselines.
comment: 7 pages, 5 figures. This work has been submitted to the IEEE for possible publication
A Swarm Approach to Public Transit Using On-demand Routing in a Slime-Mold-Inspired Framework
Demand-responsive transit (DRT) is a flexible alternative to traditional, fixed-route mass-transit networks. Although DRT can function well in low-density communities, high operating costs and low reliability are common issues. We propose that these issues can be mitigated by moving from a centralized, manually-scheduled scheme to a distributed system capable of dynamically routing multiple vehicles using a slime-mold-inspired routing algorithm to maximize network effectiveness. We additionally introduce the method of dynamic transfers to further optimize transit network efficiency. All passenger allocation and dynamic transfers are handled via a continual cooperative bidding process by the buses. In this paper, we present simulated results for a swarm-driven transit network in suburban, urban, and semi-rural scenarios, using map networks pulled from OpenStreetMap. We show that our approach increases passenger delivery rates relative to a fixed-network approach by 28%, 49%, and 101%, respectively, and results in over 75% reduction in walking time in all cases.
Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention
Fair cooperative multi-agent RL (MARL) teams maximizing egalitarian welfare are exploitable: a single selfish agent free-rides on the surplus fair agents forgo to raise the worst-off. A centralized need-based allocator removes it, but only by taking allocation out of agents' hands; whether decentralized policies can be robust was left open. We show this futility is an artifact of all-or-nothing contention. Under graded contention (a contested resource delivers $1-c$, wasting $c$), we prove that for any $c<1$ a worst-off cooperator that contests a free-rider strictly improves on yielding, so decentralized leverage exists (Prop. 1). Realizing it is a coordination problem under uncertainty: the number of free-riders is unknown and variable, so any fixed rule is dominated. We introduce CAN, a permutation-equivariant cross-attention policy over agents' observed behaviour that infers the number of free-riders and responds proportionally: turn-taking when none, contesting just enough when some. Trained against an adversarial league (PSRO), CAN keeps best-response exploitability low ($ρ\approx1.2$-$1.5$, vs. $ρ=N$ unprotected) across the contention range, wasting almost nothing at $D=0$ (efficiency $\approx1.0$) and retaining most of it at $D\geq1$ (efficiency 0.83-0.96), approaching the centralized oracle on both axes, no central allocator. Fair-MARL learners fail on complementary axes (GGF/FEN yield and are exploitable, SOTO all-contests and wastes), while CAN is both. On two further games we find clear scope, not blanket generality: CAN stays efficient and Pareto-dominates the fair learners, but its robustness holds only in proportion to the contest leverage: strong on a multi-server game, partial when it weakens, absent under winner-take-all (Prop. 1 fails). We also report its fragilities: weak leverage and zero-shot transfer to larger teams degrade it at high contention.
comment: 9 pages, 8 figures
Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies
In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.
comment: 12 pages, 8 figures, 7 tables
ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense
LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable LLM reasoning, and agent decisions remain opaque to human analysts. These three shortcomings, in realism, consistency, and auditability, are usually patched in isolation. We present ZERO-APT, a turn-based attacker-defender-judge framework that addresses them within a single architecture. For realism, ZERO-APT embeds a configurable LLM Defender that consumes Sysmon telemetry and detects attacks in real time, exposing the attacker to a live opponent rather than a passive target. For consistency, three architectural mechanisms move causal consistency from unstable LLM reasoning into enforced system architecture: separation of planning from execution, multi-dimensional ReAct feedback, and a hard-constraint-filtered action library. For auditability, a dedicated Judge agent adjudicates each round, maintains global state, and emits structured post-hoc CTI reports that make every decision traceable. We evaluate a Windows Server 2022 post-exploitation prototype across five scenarios with three Defender configurations. ZERO-APT reaches 79\% attack success rate (Aurora 22\%, PentestGPT 39\%), a Causal Consistency Score of 0.860 (Aurora 0.930, Claude Code 0.520), and end-to-end decision auditability through structured CTI reports. We release the benchmark to support evaluation of penetration agents under intelligent defense.
MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring
We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.
comment: 21 pages, 7 figures, 14 tables
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms.
Comparing Sentiment Contagion in AI-Agent and Human Social Networks: Evidence from MOLTBOOK
AI agents are beginning to interact not only with people, but also with one another. We investigate what happens to sentiment in such an AI-only social network: does negativity spread, or do replies calm it down? We study MOLTBOOK, a social network made up of autonomous language-model agents, using almost 2.9 million posts and 1.5 million comments. Negative posts receive many more replies than neutral or positive posts, so negativity still attracts attention. However, replies to negative content usually do not stay negative. They most often become neutral, and there is meager evidence that negative sentiment spreads across days. The main pattern is therefore not a cycle of negativity, but negative attention followed by neutralisation. These findings suggest that AI-agent networks may behave differently from human social networks: they may dampen emotional extremes, while still depending strongly on how interactions are organised.
comment: 8 pages without appendix
Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model
Differentiable programming offers transformative capabilities for scientific modeling, enabling gradient-based parameter estimation, sensitivity analysis, and data assimilation. Yet, migrating legacy codebases into differentiable frameworks remains a challenge. We present a five-phase LLM-based agentic pipeline that translates legacy Fortran into JAX: static dependency analysis determines module translation order from the full call graph; iterative compile-repair loops correct errors autonomously; and a Fortran reference oracle enforces numerical parity at the module level before integration and gradient verification. We instantiate and evaluate the pipeline on CLM-ml-v2, a 19,000-line Fortran land surface model, and analyze agent behavior across 73 module translation tasks. The resulting differentiable model computes the complete Jacobian in a single backward pass, recovers physical parameters in eight times fewer steps than gradient-free optimization, and achieves a 24 times wall-clock speedup over sequential Fortran at ensemble size N=2,048. Both the translated model and pipeline infrastructure are released as a reusable framework for differentiating other Earth system model components.
Detecting Perspective Shifts in Multi-agent Systems
Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.
When Does Multi-Agent Collaboration Help? An Entropy Perspective
Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textit{entropy}, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3\% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textit{Certainty Preference}: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textit{Base Entropy}: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textit{Task Awareness}: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textit{Entropy Judger}, to select solutions from MAS's pass@$k$ results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \href{https://github.com/AgenticFinLab/multiagent-entropy}{this https URL}.
comment: Project page: https://multiagent-entropy.github.io/
Chance-Constrained Correlated Equilibria for Robust Noncooperative Coordination
Correlated equilibria enable a coordinator to influence the self-interested agents by recommending actions that no player has an incentive to deviate from. However, the effectiveness of this mechanism relies on accurate knowledge of the agents' cost structures. When cost parameters are uncertain, the recommended actions may no longer be incentive compatible, allowing agents to benefit from deviating from them. We study a chance-constrained correlated equilibrium problem formulation that accounts for uncertainty in agents' costs and guarantees incentive compatibility with a prescribed confidence level. We derive sensitivity results that quantify how uncertainty in individual incentive constraints affects the expected coordination outcome. In particular, the analysis characterizes the value of information by relating the marginal benefit of reducing uncertainty to the dual sensitivities of the incentive constraints, providing guidance on which sources of uncertainty should be prioritized for information acquisition. The results further reveal that increasing the confidence level is not always beneficial and can introduce a tradeoff between robustness and system efficiency. Numerical experiments demonstrate this tradeoff: CC-CE reduces realized coordination cost by up to 35% at intermediate confidence levels, while the proposed information-gain metric consistently identifies effective uncertainty sources to reduce.
Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems
Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent's persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target-agent self-writes via memory tools, and (C) cron-delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) that prevents false-positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.
comment: 24 pages, 5 figures, v2 incorporates expanded controlled experiments (525 simulation 90 real runs), architecture visualizations, and the Three-Gate verification system
Benchmarking Emergent Coordination in Large-Scale LLM Populations: An Evaluation Framework on the MoltBook Archive
As multi-agent Large Language Model (LLM) systems scale, evaluating their emergent coordination dynamics becomes increasingly critical. However, current evaluation paradigms-focused on single agents or small, explicitly structured groups-fail to capture the self-organization and viral information dynamics that arise in large, decentralized populations. We introduce a systematic evaluation framework to benchmark role specialization, information diffusion, and cooperative task resolution in open agent environments. We demonstrate this framework on the MoltBook Observatory Archive, a dataset of 2.73M interactions among 90,704 autonomous agents, establishing quantitative baselines for emergent coordination. Our evaluation reveals a pronounced core-periphery structure (silhouette 0.91), heavy-tailed cascade distributions ($α= 2.57$), and severe coordination overhead in decentralized task resolution (Cohen's $d = -0.88$ against a single-agent baseline). By providing standardized evaluation tasks and empirical baselines, our framework enables the rigorous comparison of future multi-agent protocols and establishes evaluation itself as an object of scientific study.
comment: Substantial Revision Required
Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems
Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.
comment: 13 pages, 4 appendix. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-1
Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning ICML 2026
Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.
comment: Accepted by ICML 2026 Regular Track
AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents ICML 2026
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.
comment: ICML 2026
High entropy leads to symmetry-equivariant policies in Dec-POMDPs
We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry-equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done. Code for our experiments can be found at https://github.com/jforkel/JAX-OBL
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration ICML 2026
Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation fails. Many real-world coordination problems are not social dilemmas: helping others -- sharing documentation, unblocking a teammate -- costs the helper almost nothing while producing substantial collective benefit. Whether LLM agents cooperate in this regime, where helping is free and they are explicitly instructed to do so, remains unknown. We build a turn-based multi-agent environment that strips away all strategic complexity, making cooperation costless and trivially optimal. Across eight widely used LLMs, capability does not predict cooperation: OpenAI o3 reaches only 17% of optimal collective performance while the weaker o3-mini reaches 50%, despite identical instructions to maximize group revenue. Using a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, and find that several capable models actively withhold information despite gaining nothing from withholding. Targeted interventions address each mode: explicit protocols roughly double the performance of competence-limited models, while small sharing incentives unlock cooperation-limited ones. Our results suggest that scaling intelligence alone will not solve coordination in multi-agent systems, and will require deliberate cooperative design, even when helping costs nothing.
comment: Accepted to the ICML 2026 main conference
Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports
Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.
comment: Submitted to IEEE Transactions on Software Engineering (TSE) and currently under review
Beyond the Black Box: Interpretability of Agentic AI Tool Use
AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs), which decompose activations into sparse internal features, and linear probes, lightweight classifiers that read signals from those features. The framework reads model states before each action and infers whether a tool is needed and how risky the next tool action is. It identifies the model layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can impact subsequent agent behavior. More broadly, the paper shows how mechanistic interpretability can support internal observability for monitoring tool calls and risk in agent systems.
comment: 12 pages, 4 figures, 17 tables
Harmonia: End-to-End RAG Serving Optimization
Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent.
Systems and Control (EESS)
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
Expected String Stability of Human-Led Vehicle Platoons under Stochastic Communication Delays (Full Version)
This paper studies expected $\mathcal{L}_2$ string stability of event-triggered vehicle platoons in which a human driver leads a chain of cooperatively controlled autonomous followers under stochastic communication delays. The leader's driving behavior propagates through the string via vehicle-to-vehicle (V2V) communication, so human-induced disturbances must not amplify along the platoon. Unlike deterministic approaches based on worst-case delay bounds, we derive string-stability conditions depending on the full delay distribution through integral inequalities. The closed-loop platoon is modeled as a stochastic hybrid system capturing vehicle dynamics, communication events, and event-triggering. This framework certifies string stability even when delays exceed deterministic admissible bounds with nonzero probability. Results are evaluated under several delay distributions using the MATLAB HyEQ simulator.
comment: 7 pages, 5 figures, submitted to CPHS 2026
Impact of RTK Augmentation and INS Integration on GNSS Positioning Accuracy and Continuity: A Benchmarking Study on Inland Waterways
RTK augmentation andINS integration are widely used to improve GNSS positioning performance. However, on inland waterways, bridges and surrounding structures can degrade satellite visibility and correction availability, causing RTK augmentation loss, and GNSS/INS fusion transients. Since these effects depend on the local environment and sensor configuration, nominal receiver specifications are insufficient, and deployment-specific characterization is required. This paper presents a benchmarking study of an AsteRx-i3 D Pro+ GNSS/INS receiver installed within the mobile Sensor Box developed at KU Leuven. The study combines a real-world bridge-passage case study, static benchmarking, and closed-loop path-following experiments. The static benchmarking evaluates four receiver configurations: standalone GNSS, standalone GNSS with INS integration, RTK-augmented GNSS, and RTK-augmented GNSS with INS integration. The closed-loop experiments use INS-integrated GNSS as the navigation input and compare path-following operational performance with and without RTK augmentation. Results show that correction loss during bridge passage causes reduced positioning accuracy, increased positioning uncertainty and recovery-induced state jumps exceeding 1 m. Static benchmarking and closed-loop experiments confirm that RTK augmentation substantially improves positioning precision and uncertainty consistency, while INS integration supports short-term continuity during RTK unavailability but may introduce drift, bias, or transient uncertainty variations. By characterizing the deployment-specific receiver behavior with RTK augmentation and INS integration, this study motivates higher-level state estimation as a necessary next step toward spatially continuous and uncertainty-consistent positioning on inland waterway. The experimental data are released at: https://doi.org/10.5281/zenodo.20541733.
comment: 8 pages. 6 figures. Accepted to The 10th IEEE Conference on Control Technology and Applications (CCTA) 2026
Attack Detection using Time Series Foundation Models
This paper addresses the problem of attack detection in cyber-physical systems without any knowledge of the plant model or its structure. A remotely located plant transmits sensor measurements to an operator over a network that is assumed to be under attack. We consider two classes of attacks: model-free replay attacks and model-based stealthy attacks. For the latter, we derive closed-form expressions for the optimal stealthy attack policy against a $χ^2$ detector, for both linear and nonlinear systems. We then propose a model-structure-free detector based on TimesFM, a time-series foundation model developed by Google Research, which serves as a surrogate residual generator operating in a zero-shot fashion. We show empirically that the TimesFM-based detector achieves a comparable or superior attack detection performance. The efficacy of the proposed approach is demonstrated numerically on the IEEE 14-bus power system. We also demonstrate that TimesFM predictions can serve as a substitute for corrupted measurements, a practical mitigation technique when classical redundancy assumptions fail.
comment: Under review
From data to decisions: Bayesian modelling and global sensitivity analysis for flotation control
This work presents a data-driven framework for interpretable modelling and decision support in flotation systems, integrating Gaussian Process (GP) regression with Global Sensitivity Analysis (GSA) via Sobol indices and local interpretability using SHapley Additive exPlanations (SHAP). Based on laboratory-scale experimental data, a static GP surrogate model is developed to capture how superficial air velocity, overflowing froth velocity, froth height over the lip, pulp height, bubble size, and tailings flowrate influence the measured air recovery. The trained GP enables the computation of Sobol indices to quantify the contribution of each variable and their interactions to the overall variance in air recovery. The combination of Bayesian inference and Sobol-based sensitivity metrics provides a systematic approach to identify the dominant and interacting variables governing air recovery. This study links Bayesian learning, sensitivity quantification, and explainability to provide a foundation for data-driven control and optimisation of flotation processes.
Voltage Unbalance-Aware AC Optimal Power Flow in Distribution Networks
The increasing penetration of single-phase loads and distributed generation exacerbates voltage unbalance (VU) in distribution grids, raising concerns about power quality and complicating network operation. However, most market-clearing models and price-based coordination frameworks do not enforce VU limits within a three-phase AC representation, so the implications for grid-code compliance, numerical scalability, and economic signals remain unclear. This paper embeds VU in a three-phase AC optimal power flow market-clearing model and benchmarks two treatments: strict VU limit enforcement and objective function penalization. Building on these insights, an Improved Hybrid Limits (IHL) formulation is proposed that preserves compliance while using a smooth unbalance proxy in the objective to guide the optimization solver. Case studies on a European low-voltage feeder show that IHL maintains feasible operating points, yields price and curtailment signals consistent with conventional hybrid formulations, and converges substantially faster and more reliably than a penalization based on the exact unbalance metric. These results support IHL as a practical and scalable mechanism for VU mitigation in market-based operation of unbalanced distribution systems.
Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots
Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.
comment: Accepted to IEEE Robotics & Automation Letters
Double-Directional Wireless Channel Modeling Using Statistics-Aided Machine Learning
The double-directional (DD) wireless channel model is important for realistic system design since it provides complete propagation information. While stochastic and deterministic channel models are widely adopted, and existing machine learning (ML) solutions mostly aim to align future channel realizations, these solutions are often limited to short time spans that may not be statistically significant. Moreover, because the number of multi-path components (MPCs) varies with spatial and temporal variation of the receiver (RX) and/or interacting objects (IOs), typical ML solutions that require fixed, predefined input and output shapes fall short. To curb these limitations, we propose a statistics-aided ML solution that relies on a fixed subset of MPCs selection. More specifically, we first select top-$M$ MPCs, where $M\in\mathbb{Z}^+$ is much smaller than the total number of MPCs, and construct learnable graphs to train our proposed hybrid TimesNet-TimeFilter (TNTF) model. We then use a channel statistics-aided training method to generate future top-M DD channel realizations such that the statistics calculated from these realizations matches closely with those of the actual statistics from the complete time-varying DD channel realizations. We validate the proposed solution using extensive simulations on both synthetic stochastic channel model (SCM)-based and deterministic ray-tracing-based datasets, and demonstrate its effectiveness relative to state-of-the-art baselines.
Mixed Potential Approach to Convergence of Nonlinear RLC Circuits with Memristors
The paper considers a large class of nonlinear circuits, termed RLCM, containing all four basic circuit elements, i.e., resistors, inductors, capacitors and memristors. A companion paper [1] has introduced a mixed potential for RLCM circuits generalizing that found by Brayton and Moser for circuits without memristors. In this paper, systematic Lyapunov-like results on convergence of RLCM circuits are proved by means of the mixed potential. These hold under the basic assumption that an RLCM circuit has a complete set of variables in the flux-charge domain and they require, roughly speaking, that there is a balance, which is quantitatively estimated, between capacitors and inductors. The convergence results are robust with respect to circuit parameter variations and they include cases where the memristor circuits possess multiple stable equilibrium points, which is of importance for instance to implement content addressable memories (CAMs). The results extend to circuits possessing all four basic circuit elements previous results that pertain to circuits without memristors or memristor circuits without inductors. The main proofs are conducted by using the flux-charge analysis method (FCAM) to analyze RLCM circuits in the flux-charge domain.
Amortized Nonlinear Model Predictive Control
Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.
comment: 6 pages
On Leadership Emergence in Opinion Dynamics on Social Networks
Leadership in social groups emerges dynamically through interaction and opinion exchange. Empirical evidence indicates that individuals expressing strong opinions tend to gain influence, while sustained leadership critically depends on maintaining alignment with the surrounding social context. Motivated by these observations, we introduce a coupled dynamical model describing the simultaneous evolution of opinions and leadership in a networked population. Extending the Friedkin-Johnsen framework, we represent leadership as a time-varying susceptibility to social influence, which evolves according to a game-theoretic mechanism, consistent with social psychology evidence. Within this setting, agents strengthen their leadership by expressing decisive yet socially coherent opinions, whereas misalignment with the collective state results in a loss of influence. We analyze the coupled dynamics and establish sufficient conditions to identify which agents necessarily emerge as leaders and which act as followers in the social network.
comment: 9 pages, 4 figures
Efficient Multi-Agent Optimization of Optical Power in S+C+L-Band Systems
We propose an AI Agent tailored for link power management in multi-band systems. In S+C+L band span-level study, the agent efficiently solves various optimization objectives. In network-wide evaluation, it delivers 689.0 Tbps gain in total allocated traffic with merely 303 average interactions per power profile.
Physics-Informed Graph Learning Acceleration for Large-Scale AC-OPF with Topology Changes
In power systems, alternating current optimal power flow (AC-OPF) has been a challenging problem for decades due to its nonconvexity, but fast and efficient solutions are even more needed because of high penetration of large scale renewable generation and load growth. Recently, neural networks (NN) have gained attention in solving AC-OPF, but it is still in an early stage to be applicable for real and large-scale power system operation with topology-changing characteristics. To end this, we propose a novel framework called GraphOPF that considers topology-adaptability, scalability, NN training time, self-supervision, and feasibility altogether. Extensive experiments show that the proposed framework against the baselines is up to 200 times faster in NN training and up to 66 times faster in solving AC-OPF for large-scale power systems including the real Korean power system, while achieving more than 99% feasibility.
Tracking Control for a Dynamic Model of an Underwater Submersible
Underwater vehicles are naturally modelled as rigid bodies on SE(3) subjected to added mass effects. The passivity of the Hamiltonian structure of the system can be exploited to design energy-based stabilising controllers, however, the extension of these control designs to tracking control is not trivial since the error system for the classical error formulations is not itself Hamiltonian. In this paper, we show that a novel choice of error function leads to error dynamics that are Hamiltonian. We go on to derive an energy-based tracking control for a fully coupled model of a submersible vehicle. Asymptotic convergence of the control scheme is proved and the control is demonstrated in a simulation study of the Blue Robotics BlueROV2 Heavy submersible.
Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation
In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.
comment: 8 pages, 5 figures
Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems
Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.
Development of a Structured Approach for Establishing Mission Engineering Requirements
This paper addresses the question: How can mission effectiveness be systematically defined or approximated in the absence of customer requirements? Legacy requirements engineering frameworks presuppose customer input to define specifications but leave a gap in the process when stakeholder input is ill-defined or missing. Rapid build and development programs (such as military acquisition, space assets, infrastructure projects, etc.) often see requirement and objective evolutions throughout the proposal process, so a more adaptive method is needed. To address this gap, a structured approach is proposed that decomposes mission intent into mission context, functions, constraints, critical dimensions, effectiveness attributes, and architecture alternatives. This method conducts a mission feasibility assessment, prioritizes mission-critical dimensions using Best-Worst Scaling, and introduces a mission complexity factor to quantitatively understand the impacts of external mission difficulties, technology maturity, evidence and confidence standards, and mission utility. The resulting method provides a traceable basis for deriving Tier 1 and 2 requirements. The approach is structured to support future Unified Architecture Framework (UAF) and Systems Modeling Language (SysML) artifact integration. The proposed framework is demonstrated using a notional close air support mission example.
comment: 19 pages; 9 tables, 3 figures, presented at AIAA Aviation 2026
Learning-Assisted Day-Ahead Energy Scheduling for Frequency-Secure Inverter-Dominated Grids with Grid-Forming Battery Energy Storage Systems
As grid-forming (GFM) battery energy storage systems (BESS) are increasingly deployed to enhance power system inertial response and frequency stability, incorporating their frequency support capabilities into day-ahead energy scheduling (DAES) is essential for achieving both frequency security and operational efficiency. However, accurately determining frequency metrics in grids with coexisting GFM inverters and synchronous generators requires electromagnetic transient (EMT) simulations, which are computationally prohibitive for direct embedding in grid operational optimization models. To bridge the gap between modeling accuracy and computational efficiency, a learning-assisted DAES (LA-DAES) framework is proposed in this work. By leveraging a surrogate model to represent the frequency support dynamics of GFM BESS, the proposed framework ensures frequency security with a reasonable solve time. Comparative results demonstrate that, relative to analytical frequency-constrained DAES, the proposed LA-DAES framework more accurately captures grid frequency metrics and improves the utilization of GFM BESS.
Constrained Deep Reinforcement Learning for Cognitive Radar Resource Management
In this paper, multi-target tracking and scanning are considered in a radar system operating in the track-while-scan mode. Specifically, time allocation for radar scanning and tracking of multiple maneuvering targets under a time budget constraint is addressed, aiming to jointly optimize the performance of both tracking and scanning in a cognitive radar. We first present the details of the model for tracking and scanning and formulate the time management task as a constrained optimization problem. Subsequently, we design a \gls{cdrl} framework to find the time allocation strategy for the problem. In the proposed \gls{cdrl} framework, the parameters of the neural networks and the dual variable are learned simultaneously. The deep deterministic policy gradient (DDPG) algorithm is introduced to tackle continuous action space and its performance is compared with deep Q-learning, heuristic approaches, and an optimization-based approach. Numerical results show that the radar with the proposed \gls{cdrl} framework can autonomously allocate more time to the tracking task that requires greater attention while providing time for scanning and also constraining the total time budget below the predefined threshold.
IDDMBSE: Integrating Data-Driven and Model-Based Systems Engineering for Trusted Autonomous Cyber-Physical Systems
Autonomous cyber-physical systems (CPS) sit at the intersection of Model-Based Systems Engineering (MBSE) and data-driven Machine Learning and Artificial Intelligence (ML/AI), yet no integrated Systems Engineering (SE) methodology natively spans both. We address this gap with IDDMBSE, an Integrated Data-Driven and Model-Based Systems Engineering methodology that extends the rigorous MBSE V-process with a data-driven loop at every step, anchored in SysML, the autonomy stack, and a hybrid model-based plus data-driven trade-off architecture. We instantiate IDDMBSE as an interoperable, open-source tool chain: PERFECT, which maps SysML system architectures to executable ROS autonomy stacks for scalable performance evaluation; TRADES-X, which decomposes design-space exploration into a model-based optimization stage followed by a data-driven evaluation stage; and VERITAS, which combines formal, data-driven, and runtime verification into a single assurance workflow. We demonstrate IDDMBSE on a Trusted Autonomous Ground Robot across its development lifecycle, spanning sensor-suite selection, risk-sensitive path planning, behavior-tree task verification, conformal-prediction-based robust perception, and assured multi-robot coordination, all exercised in a contested-terrain Isaac Sim test range that we release with the tool chain. We close by sketching how IDDMBSE is being re-formulated on SysML v2 / KerML foundations to enable language-native composability and tighter ML/AI integration.
comment: 9 pages, 11 figures. This work has been submitted to the IEEE for possible publication
MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection
Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outcomes. Electrocardiography (ECG) provides a low-cost and widely available tool for detecting these abnormalities, but ECG-based detection remains challenging due to heterogeneous lead-dependent manifestations, high-dimensional multi-lead signals, class imbalance, and the limited interpretability of deep learning models. We propose a multi-scale attention-enhanced convolutional network (MSAIC-Net) for ECG-based myocardial substrate abnormality detection. MSAIC-Net employs parallel atrous convolutional branches to extract ECG features across multiple temporal receptive fields. %, enabling the model to capture both local and longer-range temporal patterns. Channel attention is then used to adaptively reweight informative lead-wise and feature-channel representations. To address class imbalance and improve feature separability, we introduce a novel imbalance-aware supervised contrastive learning strategy that encourages samples from the same class to form compact representations while increasing separation between abnormal and normal samples. Lead-wise permutation importance is further incorporated to quantify the contribution of each ECG lead and improve model interpretability. The proposed method was evaluated on two complementary datasets: a low-data institutional cohort from the University of Virginia (UVA) Health System for myocardial scar classification and the large-scale public PTB-XL dataset from PhysioNet for MI identification. Experimental results show that MSAIC-Net outperforms baseline models, with particularly pronounced improvements in the low-data UVA cohort. Overall, the proposed framework provides an effective and interpretable approach for ECG-based detection of myocardial substrate abnormalities.
Estimating Evolving Functions with Dynamic Gaussian Processes
This paper develops the Dynamic Gaussian Process (DGP), a framework for estimating functions governed by integro-difference equations (IDEs). IDEs model continuous functions that evolve with discrete-time dynamics and arise naturally from time-discretization of linear partial differential equations (PDEs). The DGP extends Gaussian process regression to time-varying functions and extends Kalman filtering to infinite-dimensional states. The DGP posterior remains a Gaussian process with closed-form mean and covariance updates, and separable kernel structure reduces the problem to a finite-dimensional Kalman filter on basis function coefficients. This paper extends the DGP to vector-valued states, enabling the treatment of higher-order PDEs, and provides a stability and approximation error analysis for the basis function approximation. The functional L2 estimation error decomposes exactly into in-subspace and out-of-subspace contributions, and all approximation errors vanish as the number of basis functions grows. The framework is demonstrated on the heat equation and on the wave equation, the latter with a vector-valued state. Code is available at https://github.com/JvHulst/Dynamic_Gaussian_Processes.
comment: This manuscript is a preprint submitted to a SIAM journal
Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers
We investigate cluster formation, involving the number and composition of clusters, in decentralized federated learning (FL) with heterogeneous machine learning (ML) optimizers. While clustering in centralized FL has enabled scalability and resource savings, its value and development in fully decentralized environments have yet to be explored. Optimizing cluster formation in such environments is challenging, especially due to the complex coupling between network graph structures, local data heterogeneity, and different local ML model optimizers. To address these challenges, we propose serverless semi-decentralized FL (SSD-FL), a methodology requiring no persistent server infrastructure. In SSD-FL, cluster formation occurs via a lightweight, one-time device-to-device (D2D) initialization phase, after which actual ML model training (alongside consensus and convergence processes) is fully serverless. Functionally, SSD-FL segments global rounds into intra-cluster and inter-cluster regimes, ensuring global convergence and consensus through novel "effective loss functions" that integrate device-specific ML optimizers with network graph-based regularization. Next, SSD-FL leverages the consensus gap via the Cheeger inequality to develop an iterative clustering algorithm evaluated against our derived convergence and consensus bounds, which incorporate a unique scoring metric to quantify data and optimizer heterogeneity across devices. Finally, experimental evaluation against three categories of decentralized FL methodologies validate that SSD-FL improves both convergence speeds and communication efficiency across various network graphs, datasets, and local optimizer regimes.
comment: Under review at IEEE/ACM Transactions on Networking
MPC for nonlinear systems: a comparative review of discretization methods
This work provides a comparative review of three different numerical methods generally used to discretize continuous-time non-linear equations appearing in model predictive control problems: direct multiple shooting, direct collocation and successive linearizations. An overview of the characteristics of each method is given and the performance of each method is evaluated through the simulation of two test cases.
Learning to optimize with guarantees: a complete characterization of linearly convergent algorithms
The design of many classical optimization algorithms is driven by the certification of linear convergence rates over classes of optimization problems. In this paper, we consider the problem of improving the average-case performance of an algorithm over a specific distribution of problem instances. While this task can be tackled by embedding trainable components into the algorithm updates, a key challenge is to preserve worst-case guarantees across the entire problem class. For classes of composite optimization problems, we show that all linearly convergent algorithms can be parametrized in terms of a baseline linearly convergent algorithm, and a set of trainable, exponentially-decaying modifications to its update rule; crucially, this parametrization excludes all-and only-the algorithms that do not converge linearly. Our results apply to improving the average-case performance of classical algorithms such as gradient descent for nonconvex, gradient-dominated functions; Nesterov's accelerated method for smooth, strongly convex functions; and projected gradient methods for optimization over polyhedral feasible sets. We illustrate how our characterization can be used for learning to optimize with linear convergence and feasibility guarantees. Numerical results showcase benefits over classical optimizers when solving ill-conditioned systems of linear equations and running a model predictive control scheme on a linear dynamical system.
Exact and Evolutionary Algorithms for Sequential Multi-Objective Transmission Topology Planning
We study day-ahead transmission topology control for high-voltage grid operation under $N-1$ security constraints. The operational task is to select, over a 24-hour horizon, a sequence of substation topologies obtained via busbar-coupler switching to relieve line overloads while limiting switching effort and topological complexity. We formulate this task as a sequential multi-objective optimization problem with four objectives used in TSO decision making: worst-case $N-1$ line loading, maximum topological depth, number of topology changes, and time spent outside the reference topology. We propose an exact block algorithm that exploits the temporal structure of topology plans: consecutive hours with the same topology are represented as blocks, enabling enumeration of the complete Pareto front over the admissible set of topologies under fixed operational bounds on depth and switching. We also develop a tailored NSGA-III-based evolutionary heuristic and evaluate it against the exact front. Using real operational data from the Dutch high-voltage transmission grid operated by TenneT, the block algorithm computes the exact front for a highly congested day in under three minutes after topology-level load-flow preprocessing. The exact front reveals low-switching plans with no DC $N-1$ thermal overloads that the tested evolutionary search fails to find. The proposed method, therefore, provides both a practical day-ahead decision-support tool for transmission operators and a benchmark for heuristic and learning-based topology-control methods.
comment: 27 pages, 6 figures
Policy Gradient for Continuous-Time Robust Markov Decision Processes
The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{ε^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}ε)$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.
Optimal Control Synthesis of Closed-Loop Recommendation Systems over Social Networks
This paper addresses the problem of designing recommendation systems for social networks and e-commerce platforms from a control-theoretic perspective. We treat the design of recommendation systems as a state-feedback infinite-horizon optimal control problem with a performance index that (i) rewards alignment and engagement, (ii) penalizes polarization and large deviations from an uncontrolled baseline, and (iii) regularizes exposure across neighboring users. The recommendation entries are fed to the platform users, who are assumed to follow a networked, multi-topic, continuous-time opinion dynamics. We show that the designed control yields a stabilizing recommendation system under simple algebraic spectral conditions on the weights that encode the platform's preference for engagement, stability of preferences, polarization, and cross-user diversity. Conversely, we show that when ill-posed weights are selected in the optimal control problem (namely, when engagement is excessively rewarded), the closed-loop system can exhibit destabilizing, pathological behaviors that conflict with the design objectives.
Disturbance rejection control barrier functions
Most existing robust control barrier functions (CBFs) can only handle matched disturbances, restricting their applications in real-world scenarios. While some recent advances extend robust CBFs to unmatched disturbances, they heavily rely on differentiability property of disturbances, and fail to accommodate non-differentiable case for safety constraints with high relative degree.To address these limitations, this paper proposes a class of disturbance rejection CBFs (DRCBFs), including knowledge-based DRCBFs (kDRCBFs) and reciprocal-compensated DRCBFs (rDRCBFs).These two DRCBFs can strictly guarantee safety under general bounded disturbances, which includes both matched or unmatched, differentiable or non-differentiable disturbances as special cases. Moreover, no information of disturbance is needed in rDRCBFs. Simulation results illustrate that the proposed DRCBFs outperform existing robust CBFs.
comment: 8 pages, 5 figures
From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks
In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace.
comment: 12 pages
ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios
Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.
comment: v2: Minor numerical corrections for Table V. 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at https://github.com/BerkeleyLearnVerify/ScenicRules
Constricting Tubes for Prescribed-Time Safe Control
We propose a constricting Control Barrier Function (CBF) framework for prescribed-time control of control-affine systems with input constraints. Given a system starting outside a target safe set, we construct a time-varying safety tube that shrinks from a relaxed set containing the initial condition to the target set at a user-specified deadline. Any controller rendering this tube forward invariant guarantees prescribed-time recovery by construction. The constriction schedule is bounded and tunable by design, in contrast to prescribed-time methods where control effort diverges near the deadline. Feasibilityå under input constraints reduces to a single verifiable condition on the constriction rate, yielding a closed-form minimum recovery time as a function of control authority and initial violation. The framework imposes a single affine constraint per timestep regardless of state dimension, scaling to settings where grid-based reachability methods are intractable. We validate on an 18-dimensional multi-agent system, demonstrating scalability and prescribed-time recovery with bounded control effort.
comment: 8 pages, 6 figures
Modeling Coincident Peak Pricing in Electricity Markets: Challenges and Peak Shaving Effectiveness
Coincident Peak (CP) pricing is widely used in U.S. electricity markets to allocate capacity and transmission costs. This paper develops a behavioral game-theoretic framework for CP-driven load shifting that couples a nonlinear cost-allocation model with day-ahead (one-shot) and real-time (sequential-learning) decision processes. We examine two update rules, namely best-response dynamics (BRD) and fictitious-play dynamics (FPD), across continuous and finite action spaces to quantify how flexibility, action resolution, and participation influence peak outcomes. Using ERCOT peak-day data, we find that FPD reliably reduces system peaks, whereas BRD is more variable and can increase peaks under tight-capacity conditions. Finer action resolution improves peak shaving, while the number of participants is largely neutral when aggregate flexibility is fixed. Meanwhile, information-provider signals can induce herding, whereas response-aware or diverse signals improve peak shaving. These results highlight both the potential and limits of CP pricing: smoothing information and enabling granular control are as important as the amount of available flexibility. The framework offers practical guidance for system operators and consumers: For ISOs, broadcasting smoothed CP signals and setting minimum controllable-capacity thresholds enhance coordination. For consumers, greater flexibility and finer control resolution improve both cost savings and peak-shaving performance.
comment: Coincident Peak Pricing, Demand Response, Game Theory, Peak Shaving
Reformulating Energy Storage Capacity Accreditation Problem with Marginal Reliability Impact
To enhance the efficiency of capacity markets, many electricity markets in the U.S. are adopting or planning to implement marginal capacity accreditation reforms. This paper provides new insights into energy storage capacity accreditation using Marginal Reliability Impact (MRI). We reformulate the commonly used reliability-based storage dispatch model as an optimization problem, enabling direct calculation of the MRI from the Lagrange multipliers, rather than using brute-force perturbation analysis. The analysis demonstrates that the EUE is a piecewise linear function and the storage MRI retains a non-negative property across various system scenarios. We further explore the influence of qualified capacity (QC), storage dispatch rules, and other key factors on storage accreditation, providing practical insights for system operators. Additionally, comparisons of storage capacity accreditation under different reliability criteria offer valuable guidance for policymakers in setting future standards. Numerical results from a modified California system validate our findings and highlight several important phenomena associated with the MRI-based accreditation scheme.
Safety by Invariance, Liveness through Refinement: Heterogeneous Contract Framework for Co-Design of Layered Control
Real-world control systems must achieve long-horizon objectives (liveness) while respecting continuous-time safety constraints, a combination that motivates hierarchical layered control architectures (LCAs). Existing LCA research, however, lacks (i) a uniform specification language across discrete planning and continuous execution, (ii) formal guarantees that specifications are preserved when interconnecting subsystems at heterogeneous time scales, and (iii) compositional separation between layers, owing to reliance on naive input-filtering laws. This paper addresses all three gaps by importing the safety--liveness decomposition into a heterogeneous assume--guarantee framework: \emph{safety is enforced by invariance} at the continuous-time layer, while \emph{liveness is achieved through refinement} at the discrete-time layer, with inter-layer coordination formalized via vertical refinement and timing-compatibility conditions. We instantiate this contract with a novel LCA combining an MPC planner, an input-to-state stabilizing (ISS) low-level controller, and a reference-governor bridge, and validate it on a Hybrid Energy Storage System (HESS) comprising a battery and a supercapacitor.
comment: 21 pages
Robotics
Learning Contact Representation for Leg Odometry
The estimation of odometry in legged robots depends on the assumption that the velocity of the foot with respect to the world remains zero during the stance phase. Feedback for the main body velocity is derived from the kinematic serial chain of the feet making accurate leg phase detection is a critical subproblem. A considerable number of studies employ ground reaction force sensors mounted at the tip of the foot to classify, yet these sensors may not be universally available for all legged robots. Additionally, these sensors are often unresponsive to unaccounted disturbances, such as slippage, while the foot remains in contact with the ground. In this study, we propose a self-supervised representation learning framework for contact detection that utilizes the standard sensor set of joint encoders without reliance on force sensor augmentations. We employ learned representations to model the stance and swing phases probabilistically. The experimental results obtained confirm the efficacy of the proposed self-supervised contact detector. Our framework exhibited superior performance in comparison to supervised methods which necessitate sensor set augmentation and labeling, as well as baseline probabilistic approaches. Additionally, we make our code available to the public.
comment: 17 pages
Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers ICRA 2026
Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.
comment: Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding
FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization
Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(τ^w, τ^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.
Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation
This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.
comment: 13 pages
Learning from Demonstrations over Riemannian Manifolds using Neural ODEs: An Extended Abstract
Learning from demonstratins (LfD) is usually performed over Euclidean spaces, while the robot state, e.g. orientation, naturally evolves over curved spaces. Therefore, to ensure natural, complex motion generation, we investigate learning from demonstrations over Riemannian manifolds that are capable of encoding both position and orientation data. Here, geodesic paths provide for natural motion between two arbitrary points within the manifold. We propose to numerically estimate geodesics via neural ordinary differential equations, mitigating large computational overhead of existing approaches. Finally, these geodesics can be decoded back into the original task space before deploying on the robot. In this extended abstract, we discuss the architecture of our framework, provide some initial insights from our simulation experiments, including comparison to other geodesic computation mechanisms, and discuss the challenges and prospects for future work.
comment: 2 pages
MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping
This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.
comment: Submitted to CoRL 2026
VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents
Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.
comment: Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/
Efficient Computation of Distance Functions for Navigation Vector Fields in Lie Groups
Vector-field-based methods are widely used for robot control and are often applied to the path-tracking problem. Some vector field approaches require repeatedly computing the distance between the robot configuration and the curve, as well as the corresponding closest point. Recently, vector fields have been extended to Lie Groups. In this case, this computation can be expensive, especially when performed at high control frequencies on embedded platforms. This paper proposes a method for efficiently computing the distance between a point and a curve represented as what is called a G-polynomial curve, which is a curve representation that generalizes polynomial curves to matrix Lie groups. The proposed approach exploits the structure of these curves to reduce the problem to a small number of polynomial root-finding computations. Simulation results show that the method significantly reduces computation time while maintaining accuracy compared to existing optimization-based approaches. Practical formulas are also provided for the case of the group SE(3), and the method is validated experimentally on a robotic manipulator. The methodology is implemented in a computational package, available online.
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.
comment: Project page: https://research.nvidia.com/labs/dair/grail/
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation
Rigorous evaluation of learning-based robotic systems is an essential prerequisite for deployment. However, real-world test data is expensive to gather; moreover, in a typical iterative development context, data gathered from the latest policy is necessarily limited in scale. This motivates evaluation methodologies that make use of heterogeneous data sources, including simulation, historical policy logs, and data collected from related platforms or environments. While such auxiliary data are abundant and inexpensive, they are generally not directly representative of real-world outcomes -- for example, performance in simulation may differ substantially from performance in the real world -- making their principled use for high-confidence performance estimation challenging. In this paper, we introduce X4Val, a general framework for variance-reduced real-world metric estimation in the presence of non-paired, multi-domain data. X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. We provide theoretical analysis and empirical evaluations on autonomous driving and real-world robot manipulation tasks, domains across which X4Val achieves up to 38.4% variance reduction and demonstrates consistent improvements over strong baselines. These results show that non-paired, heterogeneous data can be leveraged to substantially improve the sample efficiency of rigorous robotic system validation.
HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling
Scaling robust robot policies requires more than broader randomization, because physical-domain experience must remain organized and learnable throughout training. We study when a policy can benefit from harder physics and identify recoverability as a central constraint in on-policy physical-domain scaling. In on-policy training, new dynamics are useful only insofar as they remain close enough to the current policy to generate corrective on-policy data, rather than collapsing rollouts into unrecoverable failures. Using quadruped locomotion as a physically demanding benchmark for embodied generalization, we introduce HORIZON, a checkpointed frontier curriculum that expands physical domains only within the current policy's recoverable boundary. HORIZON uses rollback and boundary refinement to govern each expansion step, turning fixed randomization into a continual process of physical-domain growth. Experiments reveal three regularities of physical-domain expansion. First, direct domain widening is uneven across physical axes and often unlearnable without staged ordering. Second, domain composition is non-monotonic, and adding more domains beyond a compact core can dilute recoverable joint samples and reduce overall robustness. Third, offline distillation of isolated experts cannot substitute for the joint interaction generated by on-policy curriculum. Together, these results frame physical-domain generalization as a continual growth problem for embodied control, with recoverability as the organizing principle for on-policy expansion.
comment: 16 pages, 9 figures
Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation
World models, learned generative models that predict how an environment evolves, have become a promising tool for sample-efficient robot learning. Yet how robust they are to environmental variability remains poorly understood. To address this, we conduct a systematic study using vision-based quadrotor navigation as a testbed problem, training DreamerV3-based world models under varying levels of environmental randomness and evaluating them across all levels through cross-environment validation, spanning both Self-Supervised Learning (SSL) pretraining and Reinforcement Learning (RL) fine-tuning. We then deploy all world models and associated navigation policies on a real quadrotor in unseen environments, including an open-loop run where the model receives just 2.5s of real sensory input before all sensors are cut off, leaving the system to navigate entirely in imagination over a 12m traverse. Our results show that world model robustness during SSL pretraining is a strong predictor of sim-to-real transfer: every model that generalized well in cross-environment SSL validation deployed successfully in the real world, passing through gaps as narrow as 0.67m, whereas the model that dominated simulation policy evaluation failed on the real platform. We further identify (a) the discrete latent size and (b) the training-sequence length as the dominant factors governing world model quality.
CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation
Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.
comment: 16 pages, 5 figures
Flash-WAM: Modality-Aware Distillation for World Action Models
World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.
What Can Eye Gaze Teach Us About Real-World Cycling? Insights From the Oxford RobotCycle Project
Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.
Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement
Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.
WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation
Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.
D$^3$-MoE:Dual Disentangled Diffusion Mixture-of-Experts for Style-Controllable End-to-End Autonomous Driving
Traditional end-to-end autonomous driving frameworks frequently suffer from the "style-averaging" dilemma when trained on high-variance human demonstrations, yielding homogenized, style-uncontrollable, and even kinematically unsafe policies. To overcome this limitation, we present D$^3$-MoE (Dual Disentangled Diffusion Mixture-of-Experts), which disentangles trajectory modeling along two complementary axes. On the behavioral axis, generation is decoupled from selection: a style-conditioned diffusion process synthesizes multi-style candidate trajectories in parallel within a single scene, allowing a downstream module to select the optimal trajectory based on user preference or an evaluation score. On the physical axis, decoupled longitudinal and lateral routers activate their respective experts during inference time, trained without manual labels using self-supervised targets from orthogonal ground-truth kinematics. These activated experts, architected as Diffusion Transformers (DiT) and equipped with style-conditioned AdaLN and asymmetric lateral-fusion cross-attention, independently predict their corresponding physical state before being reassembled into a unified, kinematically coherent trajectory. Extensive evaluations on the challenging NAVSIM benchmark demonstrate that D$^3$-MoE achieves state-of-the-art planning performance, reaching 88.2 PDMS and 84.3 EPDMS by default. Moreover, our Best-of-Three ensemble strategy effectively broadens the multi-modal solution space, raising performance to 91.3 PDMS and 87.5 EPDMS. Both quantitative and qualitative analyses jointly confirm the framework's advantages in planning quality and style controllability.
comment: 8 pages, 6 figures
Teaching Robots to Say 'I Don't Know' : SENTINEL for Uncertainty-Aware SLAM ICRA 2026
Low-cost 2D LiDARs lack the intensity channel that higher-end sensors use to diagnose measurement failures, yet they are widely used on educational and budget robotics platforms. We present SENTINEL, a training - free, label - free reliability estimation framework that gives range - only LiDAR an effective diagnostic signal. SENTINEL combines geometry-based scan statistics with cross - modal depth consistency between LiDAR and an RGB - D camera to compute a per - scan reliability score between 0 and 1. When the score falls below a threshold, corrupted scans are rejected and the robot falls back to calibrated wheel odometry, preventing silent SLAM corruption. We evaluate SENTINEL on a GEFIER R1 four - wheel skid-steer robot equipped with an RPLidar A2M12 and an Intel RealSense D435i in a 185 cm by 245 cm arena containing controlled transparent and reflective failure elements on a central obstacle. Spatial reliability maps across five surface conditions, including glass, mirror, shiny paper, and a mixed mirror and shiny-paper condition, show clear separation between clean and failure cases, allowing affected regions to be identified as reject or noise. Because these failure modes are absent in simulation, validation is performed entirely on real hardware.
comment: 6 pages, 10 figures, 3 tables, This paper was accepted at Uncertainty in Open-World Robotics Workshop in conjunction with Internation conference of robotics and automation (ICRA 2026)
M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking
Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at https://github.com/Renforce-Dynamics/MultiModalWBC
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning
Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and manipulation stability. In this work, we present HapTile, a contact-grounded visuotactile manipulation dataset that advances beyond vision-only trajectory datasets by embedding physical interaction sensing at two levels: fingertip tactile feedback at the robot end-effector, and haptic-informed demonstrations at the teleoperator side. The data collection platform integrates haptic feedback directly into the teleoperation controller, enabling the operator to perceive contact interactions in real time. It is built around a standard and reproducible robotic system equipped with custom-designed fingertip tactile sensors. The dataset comprises everyday manipulation tasks spanning a broad range of contact-rich skills, including pick-and-place, folding, pressing, stacking, and other routine activities. Each task is paired with language instructions that condition the policy on the manipulation objective, together with synchronized visuotactile observations and action trajectories. In addition, we provide a benchmarking study on contact-rich policy learning using two baseline models to evaluate the effectiveness of the proposed contact-grounded dataset. The dataset and additional details are available on our website: haptile-dataset.github.io.
Real-World Deployment of a 5G-Connected Edge-Controlled Aerial Robot in Industrial Subterranean Mines
This article presents the first real-world autonomous flight of a 5G-connected aerial robot controlled by an edge-offloaded controller, and aims to bridge the gap between controlled and factual setups. The robot operates within an active industrial subterranean mine, while the high-level controller is deployed in a nearby Kubernetes-based edge cluster. Communication between the robot and the edge is enabled via a 5G New Radio (NR) Standalone (SA) network. The chosen controller is a Model Predictive Controller (MPC), which generates control actions to allow the robot to navigate seamlessly through the mining environment. A human operator selects waypoints for the aerial robot, and the MPC generates smooth, collision-free paths for autonomous executions. The proposed 5G edge-based closed-loop system is evaluated in a real industrial setting and demonstrates the potential of edge-controlled robotic systems toward time-critical, safe and efficient future deployments.
comment: 6 pages, 8 figures, MED 2026
Inverse Manipulation through Symbolic Planning and Residual Operator Learning
Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.
comment: To be presented in PlanRob26
Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives
Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.
SoftPINCH: EMG-Driven Soft Exoskeleton Assistance for Finger Flexion and Grasping
Surface electromyography (sEMG) provides a non-invasive interface for detecting hand-movement intention and controlling wearable assistive devices. However, reliable EMG-driven hand assistance remains challenging because EMG signals are affected by noise, motion artifacts, electrode placement, muscle fatigue, and inter-subject variability. At the same time, many hand exoskeletons remain mechanically restrictive or bulky, limiting comfort and natural hand motion. This work presents SoftPINCH, an EMG-driven soft wearable exoskeleton for thumb-index finger flexion and pinch grasp assistance. The system combines a tendon-driven soft exoskeleton, fingertip magnetic contact sensing, and neural EMG decoding for intention-based assistance. Surface EMG was recorded from forearm muscles during index and thumb movements, and three subject-independent decoding architectures were evaluated: LSTM, CNN+LSTM, and CNN+LSTM with attention. The CNN+LSTM and CNN+LSTM-attention models both achieved 99.4% LOSO test accuracy, outperforming the standalone LSTM, which reached 97.8%. However, the attention mechanism did not provide a significant improvement over CNN+LSTM, indicating that CNN-based feature extraction was sufficient for robust EMG representation. The CNN+LSTM model was therefore selected for real-time deployment due to its high accuracy and lower architectural complexity. Functional evaluation showed that active exoskeleton assistance reduced muscular effort during isolated finger flexion and object grasping. During weighted grasping, assistance reduced muscular effort across all tested loads, with a 92.6% reduction at the highest load. These results demonstrate the potential of SoftPINCH for intuitive, low-effort pinch assistance using real-time EMG-driven soft robotic control.
comment: Submitted to 18th International Conference on the Simulation of Adaptive Behavior (SAB 2026)
COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection
Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.
comment: 7 pages, 6 figures, 2 tables
CADENCE: Predicting Realized MAPF Execution Time Beyond Sum of Costs ICRA 2026
Multi-Agent Path Finding (MAPF) algorithms are increasingly used to plan motion for robot teams in industrial warehouses and robotic shared workspaces, but standard MAPF algorithm evaluation metrics, such as Sum of Costs (SoC), makespan, and planner runtime, can obscure how planner choices translate into realistic execution performance. We present CADENCE (Coordination and Action-Driven Estimation for Networked Continuous Execution), a hardware study of this evaluation gap on a fixed 7 by 7 workcell with seven differential drive robots, asking which features available before execution can best predict final wall-clock completion time. We compare SoC, total planned travel cost, primitive motion burden (how much basic motion the plan requires, such as makespan, turns, consecutive moves, and start-stop transitions), and interaction aware coordination structure (how much inter-robot coordination the plan induces, such as dependency links, interacting robot pairs, dependency depth, and crowding exposure). To test this, we generate 120 plans across 15 scenarios -- 5 Empty, 5 Medium Random, and 5 Bottleneck and execute each plan four times, yielding a 480 trial hardware corpus. Using both a scenario-held -- out ridge model and a trial-level mixed-effects model, we find that SoC alone is informative but incomplete, while primitive motion burden gives the strongest improvement, reducing held out error by about 48.6%-59.8% in MAE and 44.2%-61.4% in RMSE relative to SoC-only models. Interaction-aware coordination features add smaller, less uniform gains, most clearly in the mixed-effects analysis. Across both models and uncertainty checks, primitive motion burden is the most reliable additional signal beyond SoC, suggesting that much of the execution time gap is already visible in the offline plan before any robot starts moving.
comment: 7 pages, 4 figures, 3 tables and this paper was accepted at Multi-Agent Robotic Systems: Real-World Collaboration and Interaction a workshop at the international conference of robotics and automation (ICRA 2026)
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Similarly, humanoid robots should achieve smooth transitions between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference and the distribution shift induced by terrain-dependent visual and dynamic variations. Although Mixture-of-Experts (MoE) architectures can alleviate multi-skill interference, naive joint training often fails to yield clear expert specialization, limiting their effectiveness. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced and trained with a contrastive objective to shape the gating network, enabling it to capture structured terrain representations and promote expert specialization. The final action is obtained via weighted fusion of the base gait policy and the terrain-aware branch, allowing the policy to preserve stable locomotion patterns while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains, while maintaining accurate foothold placement and dynamic stability under external disturbances.
comment: Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu and Haohui Huang
BPDA-GMM: Bayesian Probabilistic Data Association via Gaussian Mixture Models for Semantic SLAM
Probabilistic data association (PDA) improves semantic SLAM in perceptually aliased scenes, but existing methods often assume a fixed landmark set, recompute association weights as the map grows, or rely on hand-tuned null-hypothesis weights. To address these limitations, we propose \textbf{BPDA-GMM}, an online Bayesian PDA framework for semantic SLAM with a growing object-level map. BPDA-GMM uses a Dirichlet-process prior to induce a Chinese Restaurant Process (CRP) association model, where accumulated evidence favors existing landmarks, and the concentration parameter assigns probability mass to new landmarks. For each semantic detection, plausible candidates are selected by a joint semantic-geometric gate, CRP-weighted association probabilities are computed, and object landmarks are updated as semantic Gaussians in closed form. The resulting landmark set forms a Gaussian mixture model, and its dominant component is passed to the back-end as a max-mixture semantic factor. When association weights are inconclusive, an ambiguity-triggered $α$-divergence tempering step improves discrimination. Finally, a decoupled back-end zeroes the pose Jacobian of semantic factors, allowing noisy detections to refine landmarks without directly perturbing the trajectory. Experiments in simulation and on a real indoor dataset demonstrate improved trajectory accuracy, semantic mapping quality, and robustness to perceptual aliasing and classifier errors over state-of-the-art baselines. Code and video are publicly available at https://github.com/thanhnguyencanh/BPDA-SLAM.
MineXplore: An Open-Source Reinforcement Learning Exploration Benchmark for GNSS-Denied Underground Environment ICRA
Underground mines present extreme conditions for autonomous robot navigation: GPS is denied, lighting is degraded, and tunnel topology is loop-rich and non-convex. Simulation benchmarks grounded in real production-mine geometry and compatible with GPU-accelerated learning pipelines do not yet exist in the open-source ecosystem. We present MineXplore, an open-source MuJoCo-based navigation benchmark derived from the Leung et al. 2017 Chilean underground copper mine dataset. The environment reconstructs a 104,423 sq.m tunnel network through an six-stage contour-to-MJCF pipeline incorporating octagonal wall cross-sections, LiDAR-sourced jagged wall geometry, three terrain friction zones, a global 5 degree incline, and periodic spot lighting. Geometric fidelity is validated at an Intersection over Union (IoU) of 0.9538 against the source survey map, and surface texture similarity scores 79.4% across six structural dimensions. A single-agent PPO baseline trained via RLlib across five independent random seeds achieves a best rolling coverage of 88.89% (3 of 5 seeds reaching the 90% coverage target), confirming that MineXplore supports stable and reproducible policy learning under realistic underground sensing and topology.
comment: 7 pages,11 figures, Submitted to the workshop Xplore:Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning and Search Held at International Conference on Robotics and Automation (ICRA)
MAD: Mapping-Aware World Models for Agile Quadrotor Flight
Agile quadrotor flight in cluttered scenes requires more than a reactive mapping from a depth image to a control command: the vehicle must remember which regions have been observed, infer nearby occupied space, and act under partial visibility and tight latency. In this paper, we present Mapping-Aware Dreamer (MAD), a geometry-aware world model for vision-based quadrotor flight. Instead of using raw-image reconstruction as the main self-supervised objective, MAD learns recurrent latent dynamics that reconstruct robocentric occupancy and visibility grid maps together with proprioceptive states. This design forces the latent state to encode local geometry, visibility history, and ego-motion in a form that is directly relevant to collision avoidance. MAD is trained in DiffAero using a GPU-parallel map-construction module that provides high-throughput supervision for occupancy and visibility. The learned representation is used in three policy-learning modes: imagination-based MAD-Dreamer and feature-extractor variants based on PPO and SHAC. Across visual navigation and racing tasks, MAD-based agents achieve higher success rates, faster flight, and better cross-task transfer than corresponding vision-only baselines. The model also produces interpretable map predictions and accurate ego-motion estimates from depth observations. We further deploy the learned policy on a physical quadrotor with an Intel RealSense D435i and demonstrate safe indoor and outdoor flight under limited sensing, reaching 9.66 m/s in simulation and 5.05 m/s in real-world forest experiments. These results show that mapping-aware world models provide a practical middle ground between modular aerial navigation and end-to-end learning.
comment: 12 pages, 14 figures
Cooperative Circumnavigation for Multiple Unmanned Surface Vehicles Without External Localization
This paper proposes a cooperative target circumnavigation framework for multiple unmanned surface vehicles (USVs) operating without external localization. The objective is to maintain a uniform circular formation of a specified radius around a target using only limited onboard sensing. The framework adopts a heterogeneous perception strategy that distinguishes between the asymmetric sensing relationships with the target and among the USVs. Specifically, the USVs obtain relative range and displacement measurements through active perception and inter-vehicle communication, while bearing measurements to a non-cooperative target are acquired via passive sensors. To estimate relative positions--both among USVs and between each USV and the target--we employ a Maximum Correntropy Kalman Filter and a Pseudo-Linear Kalman Filter, respectively. A coupled oscillator-based formation controller is designed to ensure system observability while achieving circumnavigation. Theoretical analysis demonstrates that the controller ensures the relative motions between the USVs, as well as that between each USV and the target, satisfy the persistent excitation condition, thereby guaranteeing observability of the Kalman-based filters. The effectiveness of the proposed approach is validated through numerical simulations.
comment: 17 pages, 15 figures
TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent Elastomers ICRA
Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of $70.
comment: Accepted at IEEE International Conference on Robotics and Automation (ICRA) 2026. 8 pages, 7 figures
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training
We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.
A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning
Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.
When Freshness Is Not Enough: Distribution-Aware Age of Information for Networked LQR Control
Age of Information (AoI) has become a central metric for the design of wireless update systems, especially in applications where fresh measurements support tracking, estimation, and control. Despite its popularity, the use of mean AoI or peak AoI as a surrogate for closed-loop performance is often motivated by intuition rather than by a control-theoretic derivation. This paper examines whether minimizing the mean AoI is in fact optimal for networked control systems. For scalar linear time-invariant systems with delayed intermittent updates, we show that, under state-independent scheduling policies, the infinite-horizon LQR tracking problem reduces to an optimization over the distribution of inter-scheduling intervals. The resulting objective depends on higher-order statistical moments, and in unstable or correlated regimes on exponential moments, of the inter-scheduling process rather than only on its mean. Consequently, policies with identical mean AoI can induce substantially different tracking costs. We further extend the analysis to disturbances with exponentially decaying autocorrelation and derive equivalent cost formulations that expose the role of the full interval distribution. Finally, we validate the theory using real vehicle trajectories from the NGSIM US-101 dataset. The empirical results match the predicted performance trends, demonstrating that mean AoI alone is insufficient for control-oriented network design.
Think Fast and Far: Long-Horizon Online POMDP Planning via Rapid State Sampling
Partially Observable Markov Decision Processes (POMDPs) are a general and principled framework for motion planning under uncertainty. Despite tremendous improvement in the scalability of POMDP solvers, long-horizon POMDPs remain difficult to solve. To alleviate the difficulty, this paper proposes a new approximate online POMDP solver, called Reference-Based Online POMDP Planning via Rapid State Space Sampling (ROP-RAS3). ROP-RAS3 uses novel extremely fast sampling-based motion planning techniques to sample the state space and generate a diverse set of macro actions online, which are then used to bias belief-space sampling and infer high-quality policies without requiring exhaustive enumeration of the action space -- a fundamental constraint for modern online POMDP solvers. ROP-RAS3 converges to a near-optimal reference-based solution at a rate that depends on the number of sampled actions, rather than the size of the action space. ROP-RAS3 is evaluated on various long-horizon POMDPs with up to 3000 lookahead steps and 35-dimensional state spaces, where the state, action and observation spaces can be continuous, discrete, or a hybrid of discrete and continuous. Although the reference-based optimal solution may not be the same as the optimal POMDP solution, empirical results indicate that in all of these problems, in terms of success rate, ROP-RAS3 outperforms other state-of-the-art methods by up to multiple folds. We also demonstrate the capability of our approach on a physical robot demonstration. This work extends the theory and empirical results of our ISRR24 paper. Code can be found at \texttt{https://github.com/RDLLab/ROPRAS3}.
comment: @inproceedings{Liang2026Thinking, title = {Think Fast and Far: Long-Horizon Online POMDP Planning via Rapid State Sampling}, author = {Yuanchu Liang and Edward Kim and J.Arden Knoll and Wil Thomason and Zachary Kingston and Lydia E. Kavraki and Hanna Kurniawati}, year = 2026, booktitle = {International Journal of Robotics Research (to appear)} }
OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons
Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.
Safe and Energy-Aware Multi-Robot Density Control via PDE-Constrained Optimization for Long-Duration Autonomy
This paper presents a novel density control framework for multi-robot systems with spatial safety and energy sustainability guarantees. Stochastic robot motion is encoded through the Fokker-Planck Partial Differential Equation (PDE) at the density level. Control Lyapunov and control barrier functions are integrated with PDEs to enforce target density tracking, obstacle region avoidance, and energy sufficiency over multiple charging cycles. The resulting quadratic program enables fast in-the-loop implementation that adjusts commands in real-time. Multi-robot experiment and extensive simulations were conducted to demonstrate the effectiveness of the controller under localization and motion uncertainties.
HERO: Learning Humanoid End-Effector Control for Visual Whole-Body Open-Vocabulary Object Grasping
Visual loco-manipulation of arbitrary in-the-wild objects requires accurate end-effector (EE) control and a generalizable understanding of the scene from visual inputs (eg, RGB-D images). Existing imitation and sim2real methods jointly learn both these aspects via monolithic end-to-end learning and are thus hard to scale. In this work, we bring to bear the best tools for each of these problems -- large vision models for generalizable scene understanding and simulated training for accurate EE control -- leading to an overall modular loco-manipulation system that exhibits strong generalization. Our core technical innovation is HERO, an accurate residual-aware EE tracking policy made possible by combining classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, and c) goal adjustment and replanning. Together, these innovations reduce the end-effector tracking error to 2.44cm, outperforming the strongest prior method by 5.5x. Our overall system operates in diverse real-world environments, from offices to coffee shops, where the robot reliably grasps various everyday objects (eg, mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests demonstrate the effectiveness of our proposed design. We believe our advances open up new ways of training humanoids to interact with daily objects.
comment: Project page: https://hero-humanoid.github.io/
Worth Remembering: Surprise-Gated Robot Episodic Memory
Robots solving generalist tasks need to be able to ground instructions in their past experience, since humans may refer to notable past events when giving a task (e.g., ``Take me to where the chemical spill happened yesterday''). Since memory limits make storing all past events infeasible, long-term robot memory must be selective, ideally retaining only those episodes with high utility for future tasks. However, future tasks are not typically given a priori for generalist robots. To select generically useful memories, we propose Bayesian surprise as a gating mechanism for memory formation. We present an approach to compute surprise in a semantically rich deployment-agnostic latent space provided by V-JEPA-2. Using our gated episodic memory to augment 4D scene graph-based spatial memory, we show a consistent improvement over state-of-the-art benchmarks in robot question answering, outperforming prior robot memory methods by $\geq12\%$ for temporal, spatial, and binary questions, and surpassing the performance of supervised and non-causal methods with an unsupervised causal method in event segmentation tasks.
comment: 14 pages, 2 figures, 4 tables
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained tansformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.
EVE: A Generator-Verifier System for Generative Policies
Visuomotor policies based on generative such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator-verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an interface-centric taxonomy organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video-action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the robotics integration layer -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.
How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond
Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.
Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)
While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution via a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.
Sem-NaVAE: Semantically-Guided Outdoor Mapless Navigation via Generative Trajectory Priors
This work presents a mapless navigation approach for outdoor applications. It combines the exploratory capacity of conditional variational autoencoders (CVAEs) to generate trajectories and the semantic segmentation capabilities of a lightweight visual language model (VLM) to select the trajectory to execute. Open-vocabulary segmentation is used to score and select the generated trajectories based on natural language, and a state-of-the-art local planner executes velocity commands. One of the key features of the proposed approach is its ability to generate a large variability of trajectories and select them to navigate in real-time. In real-world outdoor experiments, Sem-NaVAE achieves a 90% success rate across routes of 120-240m in unseen environments, outperforming the nearest baseline by 10% while remaining within 7% of a map-based upper bound. A video showing an experimental run of the system can be found in https://youtu.be/i3R5ey5O2yk.
comment: Accepted for publication in IEEE Robotics and Automation Letters (RA-L). 8 pages, 5 figures
Right Model, Right Time: Real-Time Cascaded-Fidelity MPC for Bipedal Walking ICRA 2026
This paper presents a multi-phase whole-body model predictive control (MPC) approach for bipedal walking, combining a detailed whole-body model in the near horizon with a simplified single-rigid-body model in the later prediction steps. This reduces computational complexity while retaining prediction capabilities. The resulting nonlinear optimal control problem is solved entirely within the general-purpose, off-the-shelf nonlinear MPC framework acados, using sequential quadratic programming (SQP). Given a contact schedule and a target walking speed, the controller optimizes joint torques without depending on preselected footstep locations. The controller is validated in MuJoCo simulation on the 18-DoF bipedal robot HyPer-2.
comment: Presented at IEEE ICRA 2026 Workshop "2cnd Workshop on Frontiers of Optimization for Robotics"
A Reproducible and Physically Feasible Dynamic Parameter Identification Framework for a Low-Cost Robot Arm
This paper presents a reproducible and physically feasible dynamic parameter identification framework for CRANE-X7, a low-cost robot arm driven by modular smart actuators. To improve practical identifiability, products of inertia are removed according to approximate link symmetry, reducing the rigid-body model from 65 to 39 base parameters. Identification motions are hand-designed from structured single-joint and adjacent-joint primitives under practical joint-range limits. The proposed pipeline combines preprocessing, inverse-dynamics-regressor-based ordinary least squares (OLS), conditional semidefinite-programming (SDP) projection for feasibility recovery, and closed-loop input error (CLIE) refinement. Candidate solutions from 40 structured trajectories are analyzed in a common principal component analysis (PCA) space to select a statistically central representative model. Because statistical centrality alone does not ensure physical acceptability, the selected model is finally screened by an all-pose positive-definiteness audit of the inertia matrix and, when necessary, corrected by a localized post-CLIE SDP rescue step. Experiments show that the parameter cloud becomes progressively more concentrated from OLS to SDP and CLIE, while the final accepted model preserves high predictive accuracy on held-out validation motions. These results demonstrate a practical route to statistically coherent and physically feasible dynamic models for low-cost robot platforms.
comment: 11 pages, 8 figures, 7 tables, 1 algorithm and 2 appendices
Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.
Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring
Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task variations. Traditional single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.
comment: To be published in IEEE OCEANS 2026 (Sanya) conference proceedings
Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.
comment: 8 pages, 3 figures. Accepted at ICRA 2026
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation
Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution.
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
comment: No
A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature
Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.
DVGT: Driving Visual Geometry Transformer
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
comment: Code is available at https://github.com/wzzheng/DVGT
PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion
Autonomous vision-based perching of quadrotors on moving inclined platforms is critical for air-ground collaboration but remains challenging due to the limited field of view (FOV). In this paper, we propose PerchRL, a reinforcement learning (RL) framework for vision-based agile perching on inclined platforms under rapid and irregular motion. Specifically, we employ a two-stage learning strategy consisting of state-based pre-training followed by vision-based fine-tuning. To improve generalization across diverse platform motions, we employ randomized platform trajectories to prevent overfitting and temporal augmentation methods to capture latent motion patterns from historical observations. During vision-based fine-tuning, a hybrid learning framework consisting of visibility-aware state augmentation and active perception rewards is presented to improve robustness under intermittent visual loss. Extensive simulation and real-world experiments demonstrate the feasibility, stability, and real-time performance of PerchRL, while successful deployment across distinct quadrotor platforms further validates its adaptability. The source code will be released to benefit the community.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
Unlike chatbots, physical AI must act while the world keeps evolving. Therefore, the inter-chunk pause of synchronous executors are fatal for dynamic tasks regardless of how fast the inference is. Asynchronous execution -- thinking while acting -- is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the remainder. However, RTC with flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance, and extra computation that inflates the latency. In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost. We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines. In summary, DiscreteRTC is simpler to implement with 0 lines of additional code to enable async inpainting, faster at inference with only ~0.7 computation compared with generating actions from scratch, and better at execution with 65% higher success rate in real-world hockey defend task compared with flow-matching RTC, and 30% higher compared with training-time flow-matching RTC. More visualizations are on https://outsider86.github.io/DiscreteRTCSite/.
ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data
Achieving versatile and natural whole-body humanoid interaction control remains challenging due to the high cost of whole-body teleoperation data. We present ZeroWBC, a teleoperation-free framework that learns humanoid whole-body interaction from human egocentric videos paired with synchronized whole-body motion and text annotations. ZeroWBC adopts a generation-then-tracking formulation to tackle the static scene whole-body interaction control problem. Given an initial egocentric image and a language instruction, a fine-tuned Vision-Language Model generates future human whole-body motion tokens, which are decoded into continuous motions and retargeted to the humanoid. The resulting reference motions, together with root and key body-part trajectories, are then executed by a general interactive motion tracking policy. To improve interaction performance, we introduce an interaction-oriented tracking reward that prioritizes global root and key body-part trajectory alignment while preserving natural whole-body motion. Experiments on the Unitree G1 humanoid robot show that ZeroWBC enables diverse scene-aware behaviors without robot teleoperation demonstrations. These results suggest a scalable paradigm for learning natural humanoid whole-body interaction from human egocentric data.
LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $π_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.
comment: Accepted at RSS 2026, Project Page:https://pku-epic.github.io/LDA
PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models
Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.
comment: 20 pages, 8 figures, 12 tables
Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation
Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.
DEFLECT: Temporal Counterfactual Preference Learning for Delay-Robust Asynchronous VLAs
Vision-Language-Action (VLA) policies increasingly rely on asynchronous inference to hide large-model latency behind ongoing robot motion. While this avoids the stop-and-go behavior of synchronous action-chunk execution, it creates a prediction-execution mismatch: the next chunk is computed from a stale observation at inference start but executed only after the robot and scene have evolved. As a result, actions that fit the prediction-time state can become misaligned with the execution-time state. Existing runtime repair, behavior-cloning, and preference-alignment approaches do not directly teach the policy to resolve this stale-input mismatch. We propose DEFLECT, an offline post-training framework for delay-robust asynchronous VLAs. DEFLECT converts latency-induced mismatch into counterfactual preference supervision: a frozen reference VLA generates a preferred chunk from the future execution-time observation and a rejected chunk from the stale prediction-time observation. The trainable policy scores both chunks under the same deployment-time input, learning to favor execution-time-aligned actions while a supervised fine-tuning anchor preserves the expert action manifold. DEFLECT requires no human preference labels, reward models, online robot rollouts, architectural changes, or additional inference-time computation. Across Kinetix, LIBERO, and three real-robot tasks, DEFLECT improves delay robustness over strong asynchronous VLA baselines, raising high-latency success by up to 6.4 percentage points and achieving a 4.6 percentage-point gain at the longest delay on a real-scale VLA.
Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots
Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.
comment: Full Version of BiPenu, including the supplementary materials
Dynamic Policy Learning for Legged Robot with Simplified Model Pretraining and Model-Homotopy-Inspired Transfer
Generating dynamic motions for legged robots remains a challenging problem. While reinforcement learning has achieved notable success in various legged locomotion tasks, producing highly dynamic behaviors often requires extensive reward tuning or high-quality demonstrations. Leveraging reduced-order models can help mitigate these challenges. However, the model discrepancy poses a significant challenge when transferring policies to full-body dynamics environments. In this work, we introduce a continuation-based learning framework that combines simplified model pretraining and model-homotopy-inspired transfer to efficiently generate and refine complex dynamic behaviors. First, we pretrain the policy using a single rigid body model to capture core motion patterns in a simplified environment. Next, we employ a continuation strategy to progressively transfer the policy to the full-body environment, minimizing performance loss. To define the continuation path, we introduce a parametric transition path from the single rigid body model to the full-body model by gradually redistributing mass and inertia between the trunk and legs. The proposed method achieves faster convergence and demonstrates superior stability during the transfer process compared to baseline methods. Our framework is validated on a range of dynamic tasks, including flips and wall-assisted maneuvers, and is successfully deployed on a real quadrupedal robot.
comment: 8 pages
3PoinTr: 3D Point Tracks for Learning Manipulation from Unconstrained Human Videos
Learning manipulation policies from human videos could greatly reduce the need for expensive robot demonstrations, but existing approaches typically require restrictive assumptions such as choreographed human motions, predefined keypoints, manual annotations, or known grasp locations. We propose 3PoinTr, a method for pretraining sample-efficient robot policies from unconstrained human videos by predicting dense 3D point tracks. In the unconstrained human demonstration videos, humans are free to follow whatever trajectories and manipulation strategies they see fit, rather than choreographing their motions to mimic a robot. 3PoinTr uses a lightweight visibility-aware transformer to learn how scene points should move from human videos, and then trains a closed-loop multitask robot policy to flexibly extract action-relevant priors from those predicted point tracks. With only 20 action-labeled robot demonstrations, 3PoinTr achieves a 25.0 percentage point higher average success rate than the strongest behavior cloning and video-pretraining baselines on real-world tasks, and a 29.6 percentage point higher average success rate in simulation. Targeted ablations support the key design choices and confirm the benefit of learning from actionless videos. We further show that 3PoinTr's point track prediction transformer outperforms a strong baseline by preserving supervision over partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.
Continuum Robot State Estimation with Actuation Uncertainty
Continuum robots are flexible, slender manipulators well suited for confined surgical environments. In these settings, unknown interaction forces and model uncertainty significantly affect robot shape, motivating state estimation from external observations. Existing estimation methods either neglect actuation modeling or rely on simplified deterministic actuation models. In contrast, we jointly estimate robot shape, external loads, and actuation inputs using mechanically principled actuation priors. To achieve this, we present a discrete Cosserat rod formulation with piecewise-linear strain integration that provides high numerical accuracy while inducing a sparse factor graph structure for efficient nonlinear optimization. We extend the framework to tendon-driven and parallel robots in simulation and validate it experimentally on a surgical concentric tube robot. Overall, our approach enables principled real-time estimation across multiple robot architectures while providing direct access to manipulator Jacobians through the linearized factor graph.
comment: Public preprint for IEEE RAL. Accepted May 2026
Multiagent Systems
SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation
Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.
Ahoy: LLMs Enacting Multiagent Interaction Protocols
An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming-free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an \ahoy agent can correctly and intelligently enact multiple protocols - concurrently if appropriate to the user goal - without specialized training. Ahoy's significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.
comment: Presented at EMAS 2026
Streaming Communication in Multi-Agent Reasoning
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
comment: project page: https://zhenyangcs.github.io/StreamMA-website/
Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.
R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search
Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.
RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation
RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning
We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.
comment: Technical report, 27 pages
When Freshness Is Not Enough: Distribution-Aware Age of Information for Networked LQR Control
Age of Information (AoI) has become a central metric for the design of wireless update systems, especially in applications where fresh measurements support tracking, estimation, and control. Despite its popularity, the use of mean AoI or peak AoI as a surrogate for closed-loop performance is often motivated by intuition rather than by a control-theoretic derivation. This paper examines whether minimizing the mean AoI is in fact optimal for networked control systems. For scalar linear time-invariant systems with delayed intermittent updates, we show that, under state-independent scheduling policies, the infinite-horizon LQR tracking problem reduces to an optimization over the distribution of inter-scheduling intervals. The resulting objective depends on higher-order statistical moments, and in unstable or correlated regimes on exponential moments, of the inter-scheduling process rather than only on its mean. Consequently, policies with identical mean AoI can induce substantially different tracking costs. We further extend the analysis to disturbances with exponentially decaying autocorrelation and derive equivalent cost formulations that expose the role of the full interval distribution. Finally, we validate the theory using real vehicle trajectories from the NGSIM US-101 dataset. The empirical results match the predicted performance trends, demonstrating that mean AoI alone is insufficient for control-oriented network design.
Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems
LLM-based agents are increasingly deployed in workflows where generated outputs may directly trigger state-changing actions. This creates an execution-boundary problem: proposed actions must be governed before they are executed. We study this problem through economically consequential multi-agent interactions and argue that deployment-grade agent systems should separate proposal generation from environment-facing execution. To operationalize this principle, we introduce the Organizational Control Layer (OCL), a model-agnostic governance infrastructure that intercepts generated actions before execution through policy enforcement and escalation, without modifying the underlying LLM generator. We evaluate OCL on adversarial buyer--seller negotiation environments adapted from AgenticPay. Across multiple frontier LLM backends, OCL reduces unsafe executions from 88% to near-zero while increasing valid success from 12% to 96%. Results further reveal a safety--utility tradeoff: strict governance improves compliance and reliability against policy and constraint violations, but can reduce flexibility in tightly constrained markets. These findings suggest that deployment-grade LLM agent systems require explicit governance at the boundary between language generation and executable actions. The source code is available at: https://github.com/SHITIANYU-hue/amai_ocl
comment: 13 pages, 2 figures
Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix
We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor--critic instability at high event rates. We instantiate both inside a single CT-MARL benchmark (Poisson-clocked price updates, observation latency $δ$, interior-optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index $Δ= 0.69 \pm 0.11$, and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48\% and adding latency drives it to a minimum of $Δ= 0.28$. The fix has clearly documented costs: it is partial ($Δ$ remains supra-Bertrand), it is non-monotone in $δ$, and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at $λ= 5$ and corrupts the phase-diagram cell at $(λ{=}5, δ{=}1)$. We accompany the scalar collusion index with trajectory-level trace diagnostics that expose the within-episode signalling collapse and the post-shock non-recovery.
CUCo: An Agentic Framework for Compute and Communication Co-design
Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.
DPBench: Structural Determinants of Multi-Agent LLM Coordination Under Simultaneous Resource Contention
We present DPBench, a benchmark for evaluating coordination in multi-agent systems built from large language models. Existing benchmarks measure task-level success under a fixed protocol; the structural conditions under which coordination succeeds or fails at all have not been characterised. DPBench adapts the Dining Philosophers problem into a controlled testbed where the action protocol, the communication structure, and the group size each vary independently. We evaluate six agents: GPT-5.2, Claude Opus 4.5, Grok 4.1, Gemini 2.5 Flash, Llama 4 Maverick, and a uniform-random baseline. Under simultaneous action at N=5 with the default prompt, deadlock ranges from 25.0% (95% Wilson CI [11.2, 46.9]) for GPT-5.2 to 90.0% [74.4, 96.5] for Gemini 2.5 Flash; sequential action is solved by four of the six. Holding the model fixed at Gemini 2.5 Flash, three protocol variables drive deadlock from 90% to within CI of zero: three rounds of pre-commitment communication (0.0% vs. single-round 86.7%), a prompt encoding a classical concurrency primitive (0.0% for resource-ordering and symmetry-breaking, against 100% for the minimal prompt), or doubling the group from N=5 to N=10 (90.0% to 10.0%). Single-round messaging and memory of past timesteps do not change the rate at the sample size we ran. Whether the same model coordinates or deadlocks is determined by the protocol, not by the model's capability.
comment: 20 pages, 4 figures
A Reliable Self-Organized Distributed Complex Network for Communication of Smart Agents
Collaboration among distributed agents is fundamental to many complex systems, particularly in communication networks where connectivity must be maintained under energy constraints. In this study, we utilize intelligent agents (nodes) trained through reinforcement learning techniques to establish connections with their neighbors, ultimately leading to the emergence of a large-scale communication cluster. Notably, there is no centralized administrator; instead, agents must adjust their connections based on information obtained from local observations. The connection strategy is formulated using a physical Hamiltonian, thereby categorizing this intelligent system under the paradigm of "Physics-Guided Machine Learning". Agents are trained via a Deep Q-Network using local observations to minimize changes in the Hamiltonian, enabling adaptive decision-making in dynamic environments. Simulation results demonstrate that the proposed collaborative strategy forms robust large-scale communication clusters while reducing transmission energy compared to baseline approaches. The network maintains high connectivity under agent mobility, density variations, node failures, and environmental obstacles, highlighting strong adaptability and resilience. These findings indicate that physics-guided reinforcement learning provides an effective mechanism for distributed topology optimization in emerging IoT and vehicular communication networks.
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges ACL 2026
Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.
comment: Accepted at ACL 2026 - CustomNLP4U Workshop. Code, prompts and data available at https://github.com/adivekar-utexas/when-gradients-collide
Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP's expressivity. Through bidirectional analysis, we identify five principles -- semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration -- as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.
comment: Logical flaw in Theorem 21
Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems
Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.
ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents
Large Language Model (LLM)-based agent systems are increasingly used for scientific tasks, yet their practical capability remains constrained by the narrow scope of manually curated tools they can invoke. Much scientific computational functionality already exists in open-source code repositories, but these resources remain difficult to standardize, operationalize, and invoke reliably for agent use. Here we present ToolRosella, a framework that automatically transforms heterogeneous scientific code repositories into standardized, agent-invocable tools. ToolRosella combines repository analysis, tool interface construction, execution testing, and iterative repair to address the problem of repository-to-tool standardization. Across 122 GitHub repositories spanning 35 subdisciplines in six domains, ToolRosella reaches a 61.5\% repository conversion success rate after iterative repair, with a 4.4 speedup over human engineers. The resulting 1,580 callable tools support a downstream task success rate of 84.0\% and improve performance when integrated into other agent frameworks, particularly on tasks whose required tools are absent from fixed, curated inventories.
comment: 20 pages
Shift Bribery over Social Networks
In shift bribery, a briber seeks to promote his preferred candidate by paying voters to raise their ranking. Classical models of shift bribery assume voters act independently, overlooking the role of social influence. However, in reality, individuals are social beings and are often represented as part of a social network, where bribed voters may influence their neighbors, thereby amplifying the effect of persuasion. We study Shift bribery over Networks, where voters are modeled as nodes in a directed weighted graph, and arcs represent social influence between them. In this setting, bribery is not confined to directly targeted voters its effects can propagate through the network, influencing neighbors and amplifying persuasion. Given a budget and individual cost functions for shifting each voter's preference toward a designated candidate, the goal is to determine whether a shift strategy exists within budget that ensures the preferred candidate wins after both direct and network-propagated influence takes effect. We show that the problem is NP-Complete even with two candidates and unit costs, and W[2]-hard when parameterized by budget or maximum degree. On the positive side, we design polynomial-time algorithms for complete graphs under plurality and majority rules and path graphs for uniform edge weights, linear-time algorithms for transitive tournaments for two candidates, linear cost functions and uniform arc weights, and pseudo-polynomial algorithms for cluster graphs. We further prove the existence of fixed-parameter tractable algorithms with treewidth as parameter for two candidates, linear cost functions and uniform arc weights and pseudo-FPT with cluster vertex deletion number for two candidates and uniform arc weights. Together, these results give a detailed complexity landscape for shift bribery in social networks.
Expected Return Symmetries ICLR 2025
Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move "left'' or "right'', and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.
comment: Published at ICLR 2025
A Simple Hierarchical Causality Primer
We provide a brief primer for the idea behind formalising hierarchical causality in the context of complex systems. Here actors are not simply agents. Actors instantiate causation classes. Agents implement local dynamics in given levels or organisation in a given system. Hierarchical causality then describes how actor-level roles constrain, select, and organise agent-level behaviour across levels. The system then necessarily requires three additional structures. First, causation classes to abstract a given form of causal influence that an actor instantiates. Second, aggregation operators to move across the levels. Third, discrete event-time maps are required because the system comprises events, and the relation between local event counts and any global clock must be specified. Our formulation here is purposefully simple and discrete.
comment: 8 pages, 1 figure; short technical primer with a toy example in an appendix, corrected minor typos, refined the admissible kernel notation
Multi-Agent Temporal Logic Planning via Penalty Functions and Block-Coordinate Optimization
Multi-agent planning under Signal Temporal Logic (STL) is often hindered by collaborative tasks that lead to computational challenges due to the inherent high dimensionality of the problem, preventing scalable synthesis with satisfaction guarantees. To address this, we formulate STL planning as an optimization program under multi-agent STL constraints and introduce a penalty-based unconstrained relaxation that can be efficiently solved via a Block-Coordinate Gradient Descent (BCGD) method, where each block corresponds to a single agent's decision variables, thereby mitigating complexity. By utilizing a quadratic penalty function defined via smooth STL semantics, we show that BCGD iterations converge to a stationary point of the penalized problem under standard regularity assumptions. To enforce feasibility, the BCGD solver is embedded within a two-layer optimization scheme: inner BCGD updates are performed for a fixed penalty parameter, which is then increased in an outer loop to progressively improve multi-agent STL robustness. The proposed framework enables scalable computations and is validated through various complex multi-robot planning scenarios.
OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection
OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides Root-governed identity admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, and signed trusted invocation. This paper presents the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, blockchain-backed authorization bulletin, prototype status, performance profile, and roadmap of OAN.
OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery
This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed lifecycle, Root-verified package model, authorization-aware Discovery, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, and domain-specific Agent protocols. OAN does not define the entire business conversation among Agents; it defines how Agent identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.
Systems and Control (EESS)
Electromagnetic Characterization of magnetic ring: Case of square cross section shape
This paper presents a comprehensive 2D analytical model of a toroidal magnetic ring with a square cross-section, subjected to sinusoidal excitation. By applying Maxwell's equations in local Cartesian coordinates and utilizing a complex permeability framework, the exact analytical expressions for the internal magnetic field, flux, complex impedance, and losses are derived. The model rigorously separates eddy current losses, hysteresis losses, and winding losses, explicitly accounting for the skin effect and complex permeability within the conductive core using separation of variables and hyperbolic functions. Furthermore, parameter for apparent permeability is expressed to map the core behavior onto simplified linear material models. The derivations establish a mathematical foundation highly suitable for standardized material characterizations, such as Brockhaus and Iwatsu ring measurements, by avoiding the heavy computational cost of 2D and 3D Finite Element Analysis.
CAPE: Control Algorithm Performance Evaluation under Learned Vehicle Dynamics Models
We propose the Control Algorithm Performance Evaluation (CAPE) framework, a systematic methodology for benchmarking racing controllers under our proposed learned enhanced physics model (EPM). The proposed framework enables cross-controller comparison by evaluating five closed-loop control architectures. We further compare our proposed EPM with two state-of-the-art learned vehicle dynamics models: Deep Pacejka Model (DPM) and Deep-learning Dynamics Model (DDM). Closed-loop experiments show that across all models and controllers, the proposed EPM achieves best average lap times. Specifically, the Adaptive NMPC with EPM achieves a time of 5.82 s, compared with 12.99 s for DPM and 8.80 s for DDM, while simultaneously producing substantially lower longitudinal and lateral tracking errors under identical controller configurations. We further evaluate all three models and five controllers using a disturbance-aware simulation framework incorporating measurement noise, process disturbances, actuator delay, and parametric uncertainty. Under moderate global disturbance scaling factor (η = 1), results averaged across the five controllers show that EPM reduces a) longitudinal tracking error by 29.0% and 17.2%; b) lateral tracking error by 24.6% and 12.3%; c) while increasing average velocity magnitude by 39.9% and 3.1% relative to DPM and DDM, respectively. Overall, CAPE establishes a systematic benchmark for evaluating the performance of learned vehicle dynamics models in a closed-loop control framework and demonstrates that our proposed EPM significantly improves controller robustness and performance under realistic uncertainties.
comment: 12 pages
CRESS: Quantifying Vulnerabilities of Attack Scenarios in Hardware Reverse Engineering
The safety, security, and reliability of microelectronic systems depend on a trustworthy, secured supply chain and design flow. Globally distributed supply chains or unintentional design weaknesses leave the door open for attacks on the hardware level. These scenarios encompass counterfeiting, hardware trojans, or on-device attacks. For these, hardware reverse engineering (RE) results play a pivotal role. The ongoing publication of new RE-involved attacks motivated the development of the common RE scoring system (CRESS). The system enables a general classification of RE-involved scenarios for a common, consistent rating. In this work, the originally qualitative system is extended to a quantitative system. We performed an extensive interview study with experts in the field. The interview results allowed us to derive weights that measure the severity of different RE-involved attack categories. The weights form an equation that quantifies scenarios, resulting in the severity-indicating CRESS score. The score enables the coherent rating of novel scenarios, renders them comparable, and supports the development of effective countermeasures. To showcase the effectiveness of the quantitative CRESS Score, six selected case studies are rated qualitatively and quantitatively. The CRESS Score proves to be significantly more expressive than the industry-standard Common Vulnerability Scoring System (CVSS).
Zero knowledge verification for frontier AI training is possible
Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution
comment: 44 pages, 2 figures
Characterization and Analysis of Emergency Landing Flight Envelopes with Graded Safety Specifications
Emergency landing flight envelope analysis traditionally adopts a binary notion of safety, whereby a trajectory is safe only if state constraints are satisfied pointwise in time. In practice, ensuring a successful landing requires recognizing that aircraft operation spans a continuum in the state space from the nominal to the critical regime. Between these regimes lies a degraded regime of states outside nominal operation that may be visited only for limited durations. Safety is therefore inherently graded, in the sense that limited exposure to degraded states may be tolerated, and must be assessed using a trajectory-dependent criterion rather than a purely pointwise-in-time one. This paper develops a Hamilton-Jacobi reachability framework for analyzing emergency landing flight envelopes under this graded notion of safety. Safety is encoded through a soft constraint defined by a designer-specified continuous violation cost function that assigns zero cost in the nominal regime and larger cost to more safety-critical off-nominal states. We introduce a general class of state- and time-dependent violation cost functions and establish monotonicity and continuity properties that characterize how the flight envelope varies with the cost of off-nominal operation. These results provide a principled sensitivity analysis linking safety conservativeness to operational capability. Building on this analysis, we propose a synthesis algorithm for parameterized violation cost functions in this class. The algorithm provably converges to the least conservative parameter under which a prescribed off-nominal safety requirement is satisfied. Numerical results for a fixed-wing emergency landing scenario under propulsion failure demonstrate the sensitivity properties and validate the algorithm.
Dual Lyapunov-based Synchronization Control of Rössler System SC
This paper proposes a novel approach for the synchronization problem of nonlinear dynamical systems, integrating dual Lyapunov stability analysis with polynomial optimization. A comprehensive review of the relevant scientific literature on synchronization methods is conducted, with a particular focus on classical Lyapunov-based methods for chaotic systems. In this study, the Rössler system is synchronized by employing dual Lyapunov-based closed-loop synchronization method. This method uses semidefinite programming and sum-of-squares polynomials to compute a nonlinear state feedback function which synchronize a chaotic system to a selected reference model. It is aimed that chaotic behavior is destroyed and, instead, a limit cycle becomes attracting. Simulation works are performed for randomly selected 100 different initial conditions to show that synchronization process is successfully performed. Furthermore, bifurcation diagrams and phase portraits are evaluated to analyze the system dynamics. The paper discusses results and how new constraints should be employed and adapted to more complex systems.
comment: Presented at the International Interdisciplinary Chaos Symposium on Chaos and Complex Systems (SCCS 2025), Istanbul, Türkiye. A version of this work has been accepted for publication in the conference proceedings and will appear in Chaos and Complex Systems: Proceedings of the 6th International Interdisciplinary Chaos Symposium (Springer Cham)
Peer-to-Peer Cloud Service Market for Data Centers Oriented to Computation-Electricity Coordination
Energy-intensive data centers (DCs) have emerged as substantial and flexible loads in modern power systems, underscoring the critical need for computation-electricity coordination. Harnessing the spatio-temporal flexibility of DC workloads is a promising approach to facilitate this coordination. However, existing studies overlook the collaborative potential of computational resource sharing among geo-distributed DCs, thereby failing to fully unlock this flexibility. In this paper, a bi-level computation-electricity coordination framework is proposed to explicitly capture the bidirectional interactions between DCs and power grid. Firstly, a peer-to-peer cloud service market (P2P-CSM) for geo-distributed DCs is proposed, which enables bilateral cloud service transactions to leverage regional heterogeneities (e.g., electricity prices, cooling efficiency). Secondly, locational marginal prices are embedded into the framework to reflect network congestion and nodal price disparities. Thirdly, a dual consensus alternating direction method of multipliers (ADMM)-based decentralized algorithm is developed as the P2P market clearing algorithm, and a bisection-assisted iterative algorithm is proposed to ensure rigorous convergence of the framework. Case studies conducted on modified IEEE 30-bus system validate that the P2P-CSM achieves a win-win computation-electricity coordination: it not only increases total DC operational profit by 22.8\%, but also effectively alleviates grid congestion and yields a 3.2\% reduction in total energy consumption.
A Survey of Smart Grid Emerging Use Cases and Relevant 5G and 6G Capabilities and Features
The growing complexity of modern energy systems has led to the adoption of Smart Grid (SG) that use advanced communication technologies to facilitate efficient, reliable, secure, and sustainable energy operation and management. Unlike existing surveys that often treat grid and communication domains separately, this work rigorously quantifies service requirements for high-complexity emerging scenarios. It provides a comprehensive overview of SG architecture that integrates digital communication infrastructure with distributed energy resources (DERs), microgrids, energy storage systems, and cybersecurity frameworks. Furthermore, emerging SG use cases such as smart distributed voltage control, real-time fault detection and self-healing, smart and autonomous monitoring, and predictive maintenance are identified, and more importantly, service performance requirements associated with these use cases have been quantified. Additionally, key capabilities and emerging SG enablers of fifth-generation (5G) and sixth-generation (6G) networks are described. These capabilities and enablers include network slicing, edge computing, spectrum management, artificial intelligence (AI) driven optimization, digital twins, and Open-Radio Access Network (O-RAN). Finally, the paper discusses open challenges and future research directions for designing scalable, intelligent, and secure next-generation SG systems.
Consistent Distributed Cooperative Localization for Ultra Large-Scale Multi-agent Systems
Cooperative localization (CL) is fundamental in emerging multi-agent systems, where agents fuse local sensing data with exchanged information to estimate their own states. At a large scale, however, tracking cross-correlations becomes infeasible, preventing the use of optimal filters. Ignoring or underestimating these correlations leads to overconfident, and thus inconsistent, estimates. Existing CL algorithms achieve good performance and consistency typically at the expense of communication, computation, or memory that scales with the network size. This is incompatible with ultra large-scale systems (ULSS) - for example, satellite mega-constellations - where per-agent resources are limited and must remain independent of the number of agents. This reveals a critical gap: no existing CL method is simultaneously well-performing, consistent, and ULSS-scalable. This paper introduces a new CL framework that addresses this gap using the recently proposed overlapping covariance intersection methodology, which enables agents to exploit limited structural information about cross-correlations without compromising consistency. The resulting CL algorithm leads to optimal conservative covariance propagation using only locally available information. The method is fully distributed, scalable to an ultra large scale, and provably recursively consistent. Simulations demonstrate substantial performance improvement over state-of-the-art consistent CL approaches while preserving scalability.
Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control
The rapid growth of hyperscale AI datacenters introduces structured, workload-driven active-power fluctuations at the point of interconnection. These fluctuations appear to the grid as time-varying disturbance injections that cannot be captured by conventional peak- or average-load representations. To reduce the residual power disturbance before it propagates into the bulk power system, this paper proposes a hybrid energy storage system with differentiable predictive control (HESS-DPC) framework for datacenter-side power smoothing. A workload-driven disturbance model is first established, representing the point-of-interconnection load deviation as the superposition of training and fine-tuning workloads to capture the structured forcing inputs that can excite generator frequency dynamics. A frequency-based rule-based controller then allocates this deviation between a battery energy storage system (BESS) and a supercapacitor (SC), assigning the energy-dominant component to the BESS and the fast-varying component to the SC. To overcome the anticipation and constraint limitations of fixed-frequency decomposition, a residual differentiable predictive control policy is trained offline to compute finite-horizon command corrections around the rule-based baseline while enforcing a one-step safeguard. Simulations on the NPCC 140-bus system show that HESS-DPC reduces grid-side residual deviations during workload transitions, improves SC state-of-charge sustainability over extended operation, and reduces generator peak-to-peak frequency deviations by more than 80 percent across all monitored generators, with the worst-affected generator response falling from 15.1 mHz to 1.3 mHz. These results confirm that local active-power smoothing at the datacenter point of interconnection can substantially mitigate frequency disturbances caused by AI workloads.
A model-free approach to control barrier functions for higher-order systems
Control barrier functions (CBFs) are a widely applied modular tool to ensure safe operation of nonlinear dynamical control systems. However, for their construction accurate knowledge of the system dynamics is typically needed. This requirement was recently alleviated for relative-degree-one systems using techniques from prescribed performance control (PPC) or funnel control (FC). This article extends the model-free CBF design to nonlinear systems of arbitrary relative degree. Moreover, we show with a simple example that a straightforward extension of existing results for relative-degree-one systems fails. Instead, we utilize novel techniques from funnel control to characterize a subset of the controls satisfying a CBF condition without requiring a dynamic model or state measurement. Finally, we demonstrate the applicability of our results on a seven degrees of freedom robotic manipulator with relative degree two.
Towards Guaranteed Optimal PID Tuning for Uncertain Nonlinear Systems
Despite the widespread use of PID controllers in engineering practice, designing optimal PID parameters has long been regarded as a challenging problem in both theory and practice, particularly when faced with uncertain nonlinear dynamical systems. Based on the authors' PID control theory established recently for MIMO nonlinear uncertain systems (Zhao and Guo, 2022), which provides a concrete PID parameter set for global stability of PID controlled systems, this paper further proposes a near-optimal PID tuning method, where only input-output (zeroth-order) data on the control performance is available. The tuning method is formulated as a constrained optimization problem and solved by an iterative learning algorithm, referred to as HRS-KW algorithm, that combines a hysteretic random search with the Kiefer-Wolfowitz algorithm, aiming at utilizing the advantages of both global exploration and local gradient acceleration. This method operates without requiring precise structural knowledge of the system dynamics, yet its almost sure convergence to an epsilon-optimal solution for the PID parameters can be guaranteed in theory while ensuring closed-loop system stability. Simulation results illustrate that our HRS-KW algorithm outperforms other related optimization methods, exhibiting better convergence to the prescribed epsilon-optimal performance set.
comment: Accepted by IFAC World Congress 2026
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.
GPU-Accelerated Direct Transcription-Based Nonlinear Model Predictive Control
In this paper, we present a GPU-accelerated framework for nonlinear model predictive control (NMPC) based on direct transcription and second-order interior-point methods. Many real-world systems exhibit nonlinear dynamics that cannot be accurately captured by linear models, motivating the use of NMPC. However, NMPC requires the repeated real-time solution of optimal control problems (OCP), which become computationally demanding large-scale nonlinear programs (NLPs) after transcription. Although GPU acceleration has emerged as a promising approach for nonlinear optimization, existing GPU-based NMPC workflows reconstruct structurally identical OCPs at each solve. This introduces substantial overhead even though successive solves differ only through updated system measurements or reference trajectories. To address this limitation, we introduce a parametric interior-point formulation that exploits the fixed structure of transcribed OCPs, enabling reuse of structure-dependent computations (e.g., symbolic factorization in sparse Cholesky) across re-solves. We evaluate the proposed framework on distillation column and 2D heated plate benchmarks against state-of-the-art CPU and GPU configurations. The results show that the framework achieves over an order-of-magnitude speedup in total NMPC run times. These improvements are primarily driven by reduced per-iteration solve times, with GPU execution achieving up to a 94% reduction compared to the baseline. Overall, the results demonstrate the effectiveness of exploiting repeated problem structure in GPU-accelerated NMPC and highlight the potential of the proposed framework to expand the envelope of real-time NMPC applications.
Mixed potential for nonlinear RLC circuits with memristors
In two seminal articles published in 1964, Brayton and Moser introduced the concept of a mixed potential as a fundamental theoretic tool to describe and analyze a class RLC of nonlinear circuits containing resistors, capacitors and inductors. In this paper, it is shown for the first time that a mixed potential can be introduced for a class RLCM of RLC circuits containing also memristors. This is possible provided a memristor circuit is analyzed not in the traditional voltage-current domain but rather in the flux-charge domain. The flux-charge analysis method (FCAM) plays a crucial role in the extension, in particular, a key step is an equivalence principle established via FCAM between an RLCM circuit in the flux-charge domain and a nonlinear RLC circuit in the voltage-current domain. Several examples are discussed where the mixed potential is explicitly found. These include basic circuits with memristors, such as Chua's circuit with a memristor and also large-scale memristor arrays with a neural architecture. This paper is mainly devoted to the introduction of a mixed potential for memristor circuits and the study of its main theoretic properties, as the possibility to write the circuit state equations in the flux-charge domain in an effective and compact form via the mixed potential. In a companion paper [1], the mixed potential is used to obtain in a systematic way Lyapunov-like results on convergence of RLCM circuits. Those results will extend existing results on convergence that do not cover the important case where there is the simultaneous presence of capacitors and inductors in a memristor circuit.
Implementation of a Misalignment-Tolerant MIMO Near Field Wireless Power Transfer System
The efficiency of reactive near-field wireless power transfer (WPT) systems degrades rapidly with increasing separation distance and is highly sensitive to misalignment between transmitting and receiving coils. These limitations restrict the mobility of powered devices and confine many near-field WPT applications to static scenarios. To address these challenges, a multiple-input multiple-output (MIMO) WPT configuration is investigated due to its capability to shape the magnetic field distribution between the transmitter and receiver. Maximum power transfer efficiency can be achieved by appropriately setting the amplitude and phase of each transmitting coil; however, determining these optimal settings requires accurate knowledge of the system's S-parameters. This paper presents the use of the Nelder-Mead iterative optimization algorithm to estimate the input amplitude and phase settings that maximize transfer efficiency in a near-field WPT system. The implementation comprises a four-element transmitter and a two-element receiver. Based on measured S-parameters, the proposed approach significantly improves WPT efficiency under both aligned and misaligned conditions.
comment: 5 pages, 8 figures, submitted and accepted to the 2026 IEEE Wireless Power Transfer Conference and Expo (WPTCE)
Small-Signal Analyses Using Analytical IBR Models and Frequency-Dependent Thévenin Equivalents
This paper investigates whether component-level studies can capture additional interactions through Small Signal Analysis (SSA) when the network connected to the Voltage Source Converter (VSC), typically modeled as a simple Thevenin Equivalent, is a more complex IBR-based network. The research investigates cases ranging from basic analytical to an IEEE 9-Bus EMT model, with and without Inverter-Based Resources (IBRs), synthesized as State-Space elements. The study identified that spurious poles at 50Hz related to dq-frame conversion can hinder the accuracy of participation factor analysis. A potential approach involves a two-step process: first, applying Henkel reduction to remove most spurious poles, followed by manual elimination of any remaining ones.
comment: Accepted at 5th International Conference on Power Systems and Electrical Technology, Osaka
Bearing Only Distributed Circumnavigation with Limited Target Information for Asymmetric Dubins Vehicles
In this paper, we present a class of bearing based distributive nonlinear guidance laws for the cooperative circumnavigation of a stationary target by a heterogeneous team of asymmetric Dubins vehicles. In such a vehicle, the maximal left and right turn capabilities are non uniform. In the given framework, the location of the target is known only to a small subset of the vehicles, called the leaders. The uninformed vehicles, called the followers, use information from their out neighbours in the communication graph, constructed using the nearest neighbour rule. A class of guidance laws is formulated that relies solely on the heading angle and line of sight angles of a designated out neighbour of the vehicle in the graph. Using Zubov theorem, we prove that the proposed guidance laws achieve global asymptotic stability under angular speed only control and ensure the convergence of the trajectories of all the Dubins vehicles to a common centre. The proposed results are validated through numerical simulations.
comment: 8 pages, 13 figures
Self-Optimizing Control of Continuous Processes Based on Reinforcement Learning
This paper addresses the Self-Optimizing Control (SOC) problem in industrial continuous processes and proposes a Reinforcement-Learning (RL)-based SOC approach to improve dynamic performance under high-frequency disturbances. In the proposed framework, the SOC controlled variable structure is embedded in the Actor network, and reward functions are designed based on economic indicators. Through interaction with the environment, the RL agent optimizes controlled variables while implicitly considering implementability and steady-state uniqueness. Online fine-tuning is further introduced to alleviate model mismatch. Experiments on a continuous stirred-tank reactor with disturbances compare the proposed RL-based SOC method with the Objective-Guided Controlled Variable Learning Approach based on steady-state data. The results show that the RL method achieves improved dynamic performance under real-time disturbances, generates smooth controlled variable outputs without explicit regularization, reduces hyperparameter-tuning complexity, and enhances adaptability through online adjustment. Overall, the proposed RL-based SOC approach provides an effective solution for nonlinear process control and offers a promising reference for future studies involving multiple disturbances, multiple operating conditions, and model-free scenarios.
Input-to-State Stable Bundle Koopman Neural ODEs for Learning Controlled Dynamics under Environmental Constraints
We propose ISS-BKNO, a unified framework that integrates Koopman operator identification, Neural ordinary differential equations (ODEs), fiber bundle geometry, and input-to-state stability (ISS) certification. Unlike prior approaches that address stability, extrinsic inputs, or environmental constraints in isolation, the proposed framework simultaneously learns controlled nonlinear dynamics while guaranteeing global convergence and a computable ISS gain. The architecture introduces a three-stage lifting pipeline: a bundle-aware encoder that separates environment-specific fibers, an environment-conditioned Koopman backbone whose matrix spectrum is constrained to lie in the left half-plane, and a residual neural ODE correction whose Jacobian satisfies a quadratic sector bound. Lyapunov-based ISS regularization turns the stability requirement into a differentiable penalty that is jointly optimized with the prediction objective. Theoretical results establish fiber invariance, ISS with an explicit gain formula, and an approximation error bound that scales with the EDMD residual. Experiments on a pendulum, cart-pole, a unicycle-based navigation task, and a Franka Emika manipulator demonstrate substantially improved prediction accuracy and robustness under matched disturbances compared with existing Neural ODE and Koopman baselines.
When Freshness Is Not Enough: Distribution-Aware Age of Information for Networked LQR Control
Age of Information (AoI) has become a central metric for the design of wireless update systems, especially in applications where fresh measurements support tracking, estimation, and control. Despite its popularity, the use of mean AoI or peak AoI as a surrogate for closed-loop performance is often motivated by intuition rather than by a control-theoretic derivation. This paper examines whether minimizing the mean AoI is in fact optimal for networked control systems. For scalar linear time-invariant systems with delayed intermittent updates, we show that, under state-independent scheduling policies, the infinite-horizon LQR tracking problem reduces to an optimization over the distribution of inter-scheduling intervals. The resulting objective depends on higher-order statistical moments, and in unstable or correlated regimes on exponential moments, of the inter-scheduling process rather than only on its mean. Consequently, policies with identical mean AoI can induce substantially different tracking costs. We further extend the analysis to disturbances with exponentially decaying autocorrelation and derive equivalent cost formulations that expose the role of the full interval distribution. Finally, we validate the theory using real vehicle trajectories from the NGSIM US-101 dataset. The empirical results match the predicted performance trends, demonstrating that mean AoI alone is insufficient for control-oriented network design.
Safe and Energy-Aware Multi-Robot Density Control via PDE-Constrained Optimization for Long-Duration Autonomy
This paper presents a novel density control framework for multi-robot systems with spatial safety and energy sustainability guarantees. Stochastic robot motion is encoded through the Fokker-Planck Partial Differential Equation (PDE) at the density level. Control Lyapunov and control barrier functions are integrated with PDEs to enforce target density tracking, obstacle region avoidance, and energy sufficiency over multiple charging cycles. The resulting quadratic program enables fast in-the-loop implementation that adjusts commands in real-time. Multi-robot experiment and extensive simulations were conducted to demonstrate the effectiveness of the controller under localization and motion uncertainties.
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
Advanced AI Service Provisioning in O-RAN through LLM Engine Integration
The Open Radio Access Network (O-RAN) architecture allows AI to be embedded directly into the RAN through modular xApps and rApps, yet creating these applications collecting data, training models, writing code, and deploying them safely remains slow and largely manual. Large Language Models (LLMs) offer strong reasoning and code-generation capabilities but are unsuited for the fast, deterministic inference required in real-time RAN control. We present a proof-of-concept Dual-Brain architecture that combines both strengths: an LLM-based orchestrator translates operator intents into data-collection policies and deployment code, while an automated ML engine, NeuralSmith, trains lightweight classifiers on demand via an API. We describe the architecture and provisioning workflow, share practical insights from a containerized O-RAN 5G~SA testbed, and discuss open research directions.
Designing Control Barrier Functions Using a Dynamic Backup Policy
This paper presents a systematic approach to construct control barrier functions for nonlinear control affine systems subject to arbitrary state and input constraints. Taking inspiration from the reference governor literature, the proposed method defines a family of backup policies, parametrized by the equilibrium manifold of the system. The control barrier function is defined on the augmented state-and-reference space: given a state-reference pair, the approach quantifies the distance to constraint violation at any time in the future. The proposed method is applied to an inverted pendulum on cart.
comment: 6 pages, 1 figure
Cooling Channel Design Optimization for High Power Multi-Chip Packages
Thermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45°C and the average chip temperature by 35.87°C compared to the baseline configuration.
comment: 9 pages, 8 figures
LiSeCo: Linear Semantic Control for Language Generation NeurIPS
The prevalence of Large Language Models (LLMs) in critical applications highlights the need for controlled language generation methods that are both computationally efficient and enjoy performance guarantees. To address this need, we use a common model of concept semantics as linearly represented in an LLM's latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model's hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose Linear Semantic Control (LiSeCo), a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene, in an online fashion, the activations of the token that is being generated in embedding space. Crucially, LiSeCo does not simply steer activations towards a desirable region. Instead, it relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. The intervention is computed in closed form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate that our approach is effective on different tasks -- toxicity, sentiment, and language (English/Spanish) steering -- while maintaining text quality.
comment: TMLR 2026 camera ready; earlier version in NeurIPS MINT Workshop 2024
Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.
Hashprice moderates the electricity demand response of Bitcoin miners
Large controllable loads, such as Bitcoin-mining facilities, are increasingly viewed as valuable sources of power-system flexibility, yet the conditions under which this flexibility is realized remain poorly understood. We examine this issue in the Texas power market, where large loads face both wholesale electricity prices and incentives created by coincident-peak-based transmission charges. We find that mining load declines as costs rise across both channels, and this response is moderated by hashprice, a measure of expected revenue for Bitcoin miners. When hashprice is higher, mining load is less responsive to electricity-sector costs. This pattern is consistent with aggregate mining load arising from heterogeneous devices operated around distinct breakeven points. The wholesale-price response illustrates this mechanism most clearly. Mining load remains largely online at low electricity prices but begins to decline once prices exceed an implied curtailment threshold, and higher hashprice shifts this threshold to higher wholesale prices. Bitcoin miners therefore respond to electricity-sector costs, but the available flexibility varies with revenue conditions in the crypto-financial sector. Treating such loads as stable demand-response resources may overstate their available flexibility.
comment: This manuscript has supplementary information in the accompanying PDF
Where to Put Safety? Control Barrier Function Placement in Networked Control Systems
Control barrier functions (CBFs) are widely used to enforce safety in autonomous systems, yet their placement within networked control architectures remains largely unexplored. In this work, we investigate where to enforce safety in a networked control system in which a remote model predictive controller (MPC) communicates with the plant over a delayed network. We compare two safety strategies: i) a local myopic CBF filter applied at the plant and ii) predictive CBF constraints embedded in the remote MPC. For both architectures, we derive state-dependent disturbance tolerance bounds and show that safety placement induces a fundamental trade-off: local CBFs provide higher disturbance tolerance due to access to fresh state measurements, whereas MPC-CBF enables improved performance through anticipatory behavior, but yields stricter admissible disturbance levels. Motivated by this insight, we propose a combined architecture that integrates predictive and local safety mechanisms. The theoretical findings are illustrated in simulations on a planar three-degree-of-freedom robot performing a collision-avoidance task.
comment: This work has been submitted to the IEEE L-CSS for possible publication. The most recent version is the revised version after the first round of reviews
Multi-Agent Temporal Logic Planning via Penalty Functions and Block-Coordinate Optimization
Multi-agent planning under Signal Temporal Logic (STL) is often hindered by collaborative tasks that lead to computational challenges due to the inherent high dimensionality of the problem, preventing scalable synthesis with satisfaction guarantees. To address this, we formulate STL planning as an optimization program under multi-agent STL constraints and introduce a penalty-based unconstrained relaxation that can be efficiently solved via a Block-Coordinate Gradient Descent (BCGD) method, where each block corresponds to a single agent's decision variables, thereby mitigating complexity. By utilizing a quadratic penalty function defined via smooth STL semantics, we show that BCGD iterations converge to a stationary point of the penalized problem under standard regularity assumptions. To enforce feasibility, the BCGD solver is embedded within a two-layer optimization scheme: inner BCGD updates are performed for a fixed penalty parameter, which is then increased in an outer loop to progressively improve multi-agent STL robustness. The proposed framework enables scalable computations and is validated through various complex multi-robot planning scenarios.
Training with Hard Constraints: Learning Neural Certificates and Controllers for SDEs
Due to their expressive power, neural networks (NNs) are promising templates for functional optimization problems, particularly for reach-avoid certificate generation for systems governed by stochastic differential equations (SDEs). However, ensuring hard-constraint satisfaction remains a major challenge. In this work, we propose two constraint-driven training frameworks with guarantees for supermartingale-based neural certificate construction and controller synthesis for SDEs. The first approach enforces certificate inequalities via domain discretization and a bound-based loss, guaranteeing global validity once the loss reaches zero. We show that this method also enables joint NN controller-certificate synthesis with hard guarantees. For high-dimensional systems where discretization becomes prohibitive, we introduce a partition-free, scenario-based training method that provides arbitrarily tight PAC guarantees for certificate constraint satisfaction. Benchmarks demonstrate scalability of the bound-based method up to 5D, outperforming the state of the art, and scalability of the scenario-based approach to at least 10D with high-confidence guarantees.
comment: Accepted to 3rd International Conference on Neuro-Symbolic Systems (NeuS 2026)
Enhancing Hallucination Detection through Noise Injection ICLR 2026
Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.
comment: ICLR 2026 main conference paper
Robotics
Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation
Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.
RSC: Decentralized Rigid Formation Flocking for Large-Scale Swarms via Hybrid Predictive Control and Online Reconfiguration
Decentralized rigid formation flocking requires a swarm of autonomous agents to maintain a predetermined geometric configuration while moving, relying solely on local sensing and communication. However, existing decentralized control methods struggle to maintain strict inter-agent distance constraints in cluttered environments, often suffering from local minima deadlocks, high frequency control oscillations, or limited flexibility during obstacle navigation, resulting in low success rate. To address these limitations, we propose Rigid Swarm Control (RSC), a decentralized control framework for large-scale rigid formation flocking. To escape local minima via robust long-term planning while ensuring short-term safety, RSC integrates finite-horizon trajectory predictions with a reactive artificial potential field (APF) safety controller within a hybrid architecture. Furthermore, to accelerate formation reassembly after obstacle traversal without interrupting task execution, RSC introduces an online leader-follower reconfiguration mechanism based on stable role exchange. Extensive evaluations in challenging cluttered environments with 25 UAVs demonstrate that RSC reliably unifies rigid formation maintenance, obstacle avoidance, and target tracking. Under strict success criteria - collision-free operation with a maximum relative edge-length error below 10%, RSC achieves an 83% success rate, significantly outperforming existing heuristic and learning-based baselines that fall below 5%.
comment: 8 pages, 4 figures, two-column format
What Are We Actually Benchmarking in Robot Manipulation?
A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. We identify four failure modes, each of which weakens or invalidates a benchmark's role as a valid proxy for that capability: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence. We propose one diagnostic per failure mode. We audit LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0 under these diagnostics. LIBERO and CALVIN fail multiple diagnostics. RoboCasa and RoboTwin 2.0 fail fewer, despite appearing far less often in recent progress claims. On LIBERO, a 0.09B probe with no language encoder scores at or near reported SOTA, and most reported gains are not provably statistically significant. On CALVIN, randomizing block poses within the training range drops performance for every tested policy. We release the four diagnostics with reference implementations for authors and reviewers to apply before treating a benchmark score as evidence of progress. Code and artifacts are available at https://ripl.github.io/manipulation_benchmark_audit/.
comment: 31 pages, 6 figures
PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification ICRA 2026
Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot's perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of approximately 39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.
comment: Accepted at ICRA 2026 (Vienna); published on arxiv for archival purposes. See also https://percept-twin.github.io/
Towards Estimating Normal and Shear Interface Pressures in Prosthetic Sockets via Least Squares and Mechanics Modeling
Prosthetic socket fitting remains largely manual and iterative, and objective fit metrics are still limited. Part of the challenge is the lack of long-term real-life pressure data at the residual limb--socket interface. Traditional pressure sensors are prone to drift over time, and capture only normal pressures at sparse locations within the socket, missing a critical component for biomechanical analysis: shear. Although some sensors can report both normal and shear interface stresses, these components are often difficult to decouple because of measurement crosstalk. One potential path forward is to develop models that can augment available measurements. This work introduces a testbed to evaluate model performance under sparse pressure sensing using two complementary validation signals: (i) the global wrench (\ie, total forces and moments expressed in an orthonormal frame) transmitted through the socket, by an artificial residual-limb, and (ii) local interface loads (\ie, decoupled normal and shear pressure components in a right-hand-rule orthogonal frame that lives in each instrumented location) measured by sparse sensing clusters, each composed of four capacitance-sensing channels. Rather than presenting full-field pressure estimates, the focus is on an analysis sequence that quantifies how well candidate mechanical models explain both global and local measurements under controlled conditions. A quasi-static spring--mass contact model is evaluated, and its parameters are identified via a two-stage convex least-squares problem. Validation under static loading shows that estimating constant bias terms reduces steady offsets in the wrench channels and improves agreement with local measurements. A Pareto-front sensitivity analysis further illustrates how the trade-off between global and local objectives changes when bias terms are included.
DLO-Lab: Benchmarking Deformable Linear Object Manipulations with Differentiable Physics ICML 2026
We address the challenge of enabling robots to manipulate deformable linear objects (DLOs), such as ropes, cables, and rubber bands. Prior work has primarily focused on narrow, task-specific problems, often relying on real-world demonstrations or handcrafted heuristics. Such approaches, however, struggle to scale to the wide variety of materials and tasks encountered in practice, and collecting sufficiently diverse real-world data is often impractical. Additionally, existing simulation environments offer limited support for the broad spectrum of material behaviors necessary for generalizable DLO manipulation. To overcome these limitations, we introduce a differentiable simulator explicitly designed for versatile DLO manipulation. Our simulator models a wide range of material properties-including (in)extensibility, elasticity, bending plasticity, and complex interactions with other objects-providing a robust foundation for learning and evaluating manipulation skills. Building on this simulator, we propose a benchmark suite of representative tasks that highlight the unique challenges of DLO manipulation. The successful execution of these tasks is often hindered by the topological complexity and grasp sensitivity inherent to DLOs. Therefore, we introduce a specialized DLO agent that explicitly manages these challenges by proposing strategic grasping points and decomposing long-horizon tasks to maximize control authority. Finally, we evaluate various policy-learning algorithms using our framework, alongside sim-to-real transfer experiments, demonstrating our platform's potential to advance DLO manipulation.
comment: ICML 2026, the project page: https://dlo-lab-26.github.io/
Dual Advantage Fields ICML 2026
Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.
comment: Accepted by ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
Distribution-Free Risk-Aware Planning and Control Under Uncertainty Using Conformal Spectral Risk Control
Safe navigation in dynamic and uncertain environments often relies on accurate estimation of, or assumptions about, the true underlying uncertainty. However, accurately characterizing the true uncertainty distribution is often difficult due to limited data or imperfect information. An incorrect understanding of the uncertainty and its associated risk may lead to dangerous decisions even under high levels of risk aversion. To address this issue, we propose a risk-aware model predictive control (RA-MPC) framework that incorporates prediction sets to guarantee risk control below a user-specified threshold without requiring assumptions about the underlying uncertainty distribution. To generate the prediction sets, we develop a distribution-free risk quantification framework that extends conformal risk control (CRC) to general spectral risk measures. We then show that incorporating the prediction sets into the MPC framework provides statistical safety guarantees in terms of spectral risk constraint satisfaction even under uncertainty misspecification. We validate the proposed framework in simulated vehicle obstacle avoidance scenarios, demonstrating improved safety and reduced solve time compared to a baseline RA-MPC framework.
comment: Submitted to IEEE Robotics and Automation Letters
Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation
Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction-region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision further supports diverse downstream applications, with real-time affordance grounding and affordance-conditioned manipulation policies as two representative examples. Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and providing useful spatial priors for downstream manipulation. All datasets and code will be publicly released to promote open research.
comment: 23 pages
Multi-Agent Next-Best-View Optimization for Risk-Averse Planning IROS 2026
Multi-agent Next-Best-View (NBV) selection for safe path planning in uncertain and unknown environments requires informative, safety-aware, and efficient coordination. Centralized approaches rely on sharing raw sensor data or significant communication overhead, resulting in limited scalability. We propose a distributed, risk-aware multi-agent NBV framework in which each robot maintains a private local 3D Gaussian Splatting map and the team jointly maximizes expected information gain (EIG) restricted to masked zones along planned trajectories. The resulting distributed objective is solved by Consensus ADMM (C-ADMM) over a communication graph, with each robot exchanging only candidate viewpoints, planned trajectory descriptors, and scalar EIG contributions. Collision risk along each trajectory is modeled via Average Value-at-Risk (AV@R) over the local 3DGS map and used both to shape the masking radius and to score planned paths. Experiments in Gibson environments at multiple team sizes show that the distributed formulation approaches the centralized baseline in mapping quality and trajectory safety while reducing communication by orders of magnitude.
comment: 8 pages, 5 figures. Submitted to IROS 2026
Selecting haptic guidance models in teleoperation: guidelines from a comparative user study
Haptic guidance in teleoperation enhances operator performance through force feedback. This paper presents guidelines to select the most appropriate model considering the task, the environment and the operator. We define a unified formulation expressing most common models (spring-damper, potential field, and guiding tube) as variations of a stiffness-damping system with model-specific guiding functions. We conducted a user study comparing the three classical models across six scenarios with varying environmental conditions in a vertical farming task. Results show no universally superior model: spring-damper excels in cluttered environments, potential field in free spaces (but it shows risks near obstacles), and guiding tube offers a balanced compromise. We propose novel objective metrics to evaluate the interaction, and show that guiding force magnitude correlates with comfort and trust scores. These findings provide practical model selection guidelines through environmental characteristics and real-time evaluation metrics.
comment: EUROHAPTICS 2026 - EuroHaptics International Conference, Jul 2026, Sienna, Italy
CoPark: Learning Reactive Parking via Self-Play
Learning a single policy that reaches a goal with high geometric precision while interacting safely with nearby agents poses conflicting objectives. Precision favors commitment to a fixed geometric plan, whereas interaction requires immediate deviation when another agent intrudes, causing policies optimized for one objective to often fail at the other. We study this problem in the context of reactive autonomous parking, where multiple vehicles must reach assigned slots with sub-meter terminal accuracy while remaining responsive to neighboring vehicles throughout the maneuver. We propose CoPark, a multi-agent self-play RL approach built on a residual-policy architecture. A precomputed offline plan provides a fixed action prior, while a residual head learns the reactive corrections. The residual policy learns behaviors under self-play, where data and scripting fall short, while the fixed prior holds the slot-frame geometry that pure policies struggle to reach reliably. The key design is a partner-threat-modulated, channel-asymmetric release of the prior. A continuous threat signal shifts authority of the longitudinal channel to the residual head to enable yielding, while the lateral channel remains anchored to the precomputed reference to preserve sub-meter slot alignment. A closed-loop refinement layer corrects residual terminal error from action-grid discretization. We train our policy on six parking lots and evaluate zero-shot on our new reactive-parking benchmark spanning Dragon Lake Parking (DLP) and DeepScenario Open 3D (DSC3D). CoPark achieves ~70-85% success with only 3-6% collision rate, substantially outperforming classical, imitation-learning, and large-scale RL baselines. Importantly, the results demonstrate emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.
CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization
We introduce CLAW, a fully end-to-end self-supervised framework for learning a world model jointly with continuous latent action representations directly from action-free videos. Our approach leverages adversarial latent regularization and diffusion-based video generation to capture structured and semantically meaningful action representations while modeling rich, predictive environment dynamics, without relying on any action labels or annotations. By simultaneously training the Latent Action Model and world model, CLAW learns to reason about how inferred actions induce environment transitions from visual observations alone. We show that the resulting latent action world model supports both imitation learning from observation and goal-directed planning. In imitation learning, latent actions extracted from raw videos enable behavior cloning. For planning, CLAW generates sequences of latent actions and maps them to executable actions to reach desired goals. Extensive experiments across diverse tasks and embodiments demonstrate that CLAW produces semantically meaningful latent action representations, supports effective action transfer, and enables planning and imitation from observation, outperforming existing methods.
comment: 8 pages, 15 pages of supplementary material
Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models CVPR 2026
Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.
comment: 7 pages, 4 figures, Presented as a short paper at IEEE CVPR 2026, AI4Space Workshop
Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs
Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable under real-world sensor degradation. In this paper we present a controlled perturbation study of Vision-Language-Action (VLA) robustness in autonomous driving, evaluating Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations (Gaussian noise at four intensities, two lighting extremes, and two fog levels; ${\sim}18{,}000$ inference trials). We find that reasoning consistency is a high-fidelity indicator of trajectory reliability: when Chain-of-Causation (CoC) explanations change after perturbation, trajectory deviation spikes $5.3{\times}$ (21.8m vs 4.1m), with $r\!=\!0.99$ across attack types and $r_{pb}\!=\!0.53$ per-sample (Cohen's $d\!=\!1.12$). A controlled ablation provides evidence that enabling CoC generation is associated with improved trajectory accuracy (11.8% on average across conditions; $p < 0.0001$) under matched inference settings. Over the tested noise range ($σ\in \{10, 30, 50, 70\}$), degradation is approximately linear ($R^2\!=\!0.957$), while standard input preprocessing defenses provide only marginal relief. Together, these results establish CoC consistency as a quantitative proxy for planning safety and motivate reasoning-based runtime monitoring for safer VLA deployment.
Learning to Adapt Control Barrier Functions Under Epistemic and Aleatoric Uncertainty
Control barrier functions (CBFs) provide a tractable mechanism for enforcing safety constraints in robotic systems, but their practical performance depends strongly on the choice of class-K function parameters. Under input constraints, conservative parameters often preserve feasibility at the cost of slow progress, whereas aggressive parameters can make the CBF-based optimization infeasible or unsafe. This paper proposes Online Adaptive CBF (OA-CBF), a framework for adapting CBF parameters at runtime. We introduce the notion of locally validated CBF parameters, which certify candidate parameters over a finite prediction horizon, and show that safety is preserved when such validation is maintained over successive update intervals. To identify locally validated parameters efficiently, OA-CBF trains a probabilistic ensemble neural network to evaluate queried CBF parameters rather than directly predict a single parameter. A graph-attention encoder represents variable-size obstacle environments, an epistemic uncertainty gate calibrated by conformal prediction rejects unreliable predictions, and a distributionally robust CVaR condition screens aleatoric risk. Among the verified candidates, OA-CBF selects the parameter with the best predicted progress metric and applies it through either an MPC-CBF or CBF-QP safety filter. Simulation studies on dynamic unicycle, planar and three-dimensional quadrotor, kinematic bicycle, and VTOL quadplane benchmarks show that OA-CBF reduces the conservatism of fixed-parameter CBF controllers while maintaining low collision and infeasibility rates.
comment: Extended journal version of the IEEE CDC 2025 paper (available as arXiv:2504.03038v5). Project page: https://www.taekyung.me/oa-cbf
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Multiagent Systems
What Makes Majority Illusion Easy to Detect?
Majority illusion is an undesirable phenomenon in social networks in which agents incorrectly perceive a minority opinion as dominant. This can severely distort collective behavior and decision-making. We study the fundamental question of detecting whether a social network allows for a majority illusion. Formally, in the $q$-Majority Illusion problem, we ask whether there exists a binary labeling of agents in which at least a $q$-fraction of agents have the majority of neighbors with the minority label. We investigate how various structural properties of the underlying social network influence the tractability of this question, and provide a detailed map of its computational complexity.
Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions
How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.
comment: Submitted to the Journal of Artificial Societies and Social Simulation (JASSS)
From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members
With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.
On dynamic multi-agent pathfinding methods: review, simulations and modifications
This paper presents a systematic study of pathfinding algorithms in the context of Dynamic Multi-Agent Pathfinding (D-MAPF), a setting that combines dynamic obstacles, partial observability, and inter-agent conflicts. We evaluate six representative algorithms: Dijkstra, D* Lite, Space-Time A*, WHCA*, M*, and a novel method denoted as A** within a unified simulation framework. The proposed A** algorithm introduces a template-based approach that decouples offline geometric path generation from online temporal adaptation. By precomputing multiple diverse candidate paths and dynamically reconnecting to them using space-time planning, A** improves solution quality in environments with frequent changes and limited sensing
D2MDT: Department-aware Multidisciplinary Team Consultation with Deliberation for Efficient Clinical Prediction
Electronic health records (EHRs) are central to clinical prediction, but existing methods either rely on correlation-driven deep models or use single large language models (LLMs), making it difficult to support multidisciplinary clinical reasoning. Recent multi-agent systems (MAS) provide a promising alternative, yet current EHR-grounded MAS methods still suffer from weak evidence differentiation across agents and redundant multi-round interaction. We propose D2MDT, a Department-aware MultiDisciplinary Team Consultation with Deliberation for Efficient clinical prediction. D2MDT first constructs structured EHR evidence and consultation-ready semantic evidence for multi-agent consultation. It then assigns patient-specific department perspectives to doctor agents and retrieves complementary evidence for collaborative consultation. To improve efficiency, D2MDT further introduces residual deliberation, which updates only unresolved consensus rather than replaying the full discussion history. Finally, D2MDT fuses the refined consensus report with structured EHR representations for prediction. Experiments on mortality prediction show that D2MDT improves both predictive performance and consultation efficiency. We release the code online to ease the reproducibility of this paper.
comment: Preprint. 17 pages
A formal definition and meta-model for a machine theory of mind
This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to "crack" the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.
comment: 48 pages, 2 figures
Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study
LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.
comment: 26 pages. Artifact (catalog CSV, Rust crate, formal proofs): https://github.com/sajjadanwar0/token-budgets
FORGE: Multi-Agent Graduated Exploitation and Detection Engineering
Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.
comment: 18 pages, 4 figures, 3 tables. Accepted at the AgentCy Workshop at the 21st International Conference on Availability, Reliability and Security (ARES 2026). Keywords: Vulnerability assessment, Multi-agent systems, Exploit generation, Detection engineering, Risk prioritization
MeDxAgent: Multi-Agent Consultation for Interactive Medical Diagnosis
Large language models (LLMs) are increasingly used for health-related decision support. Yet most evaluations treat diagnosis as a single-shot task with complete information provided upfront, often as a multiple-choice selection. This diverges from clinical practice, where diagnosis is interactive and open-ended, involving sequential hypothesis refinement through targeted questioning. We address this gap. We build MeDxBench, a large-scale benchmark of 4,421 clinical cases across 20 specialties. We further propose MeDxAgent, a multi-agent consultation system for interactive diagnosis, and systematically study its prompt-, flow- and agent-level design choices. MeDxAgent achieves a 10.3% accuracy gain over the baseline on MeDxBench, closing 52.3% of the gap to a full-information oracle. We find that specific design choices: collecting demographics first, passing summarized dialogue for diagnosis, and feeding candidate diagnoses for targeted questioning, improve accuracy, mirroring how physicians reason, though their effect emerges fully only in combination. Code and dataset will be released upon publication.
comment: 28 pages, 6 figures
Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift
Artificial-intelligence surrogates can support second-by-second thermal-hydraulic forecasting, but models selected and frozen offline may become condition-locked once deployed outside their pretraining envelope. This study develops a guarded continual-adaptation framework for experimental thermal-hydraulic loop data in which role-separated agents - Monitor, Diagnosis, Adaptation, Safety-Auditor, and Orchestrator - diagnose error signatures, prioritize candidate model families, and review promotions, while deterministic champion-challenger gates and background shadow learning retain final authority over model replacement. Seven surrogate families were screened by blocked three-fold cross-validation, and a temporal Fourier neural operator was selected as the initial champion for 60-s-history-to-10-s-trajectory forecasting on two held-out transients, with three seeds per adaptive mode. Static deployment gave a channel-averaged MAE of 7.06 and a 56.8% warning-exceedance ratio; rule-based adaptation reduced MAE to 6.54, whereas shadow refresh alone remained close to Static. The MA-Full mode, in which the role-separated multi-agent council reviews every evaluated stream step, achieved the lowest mean error, 5.72, and 35.8% exceedance, corresponding to a 19.0% improvement over Static. Paired bootstrap intervals against Static excluded zero, although intervals among adaptive modes overlapped and the six paired units limit broad statistical claims. Validated promotions from the neural operator to Transformer and graph neural network indicate that logged, gate-controlled adaptation can support auditable surrogate evolution while deterministic gates retain deployment authority.
Solipsistic Superintelligence is Unlikely to be Cooperative
AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.
comment: 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026
SPOQ: Specialist Orchestrated Queuing for Multi-Agent Software Engineering
Multi-agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovations: (1) wave-based topological dispatch that computes parallel execution waves from task dependency graphs; (2) dual validation gates applying quality metrics before execution (planning validation) and after (code validation) to reduce rework cycles; and (3) Human-as-an-Agent (HaaA) integration, where a human specialist participates in decomposition and can be consulted during execution. SPOQ uses a three-tier agent hierarchy (Opus workers, Sonnet reviewers, Haiku investigators) to optimize cost-quality tradeoffs. We evaluate SPOQ through four experiments. Experiment 1: wave dispatch approaches the critical-path lower bound (ratio 1.03--1.11, speedup up to 14.3x); on a 2-slot local backend it delivers a stable 1.4x speedup. Experiment 2: SPOQ improves planning coverage from 93.0 to 99.75, eliminates cyclic plans, and lifts parallelism from 31.0 to 75.25. Experiment 3: dual validation reduces defects from 0.34 to 0.20 per task and lifts test pass rate from 91.25% to 99.75%. Experiment 4: human review reduces residual defects from 0.47 to 0.03 per task. Results are replicated on a locally hosted open-weights model (Qwen3.6-35B-A3B), verifying gains are attributable to orchestration rather than any specific model. A longitudinal study across 17 repositories, 8,589 commits, 1,822 tasks, and 13,866 tests (99.87% pass rate) provides ecological validation.
comment: 55 pages, 12 tables, 6 figures; includes longitudinal deployment study and open-weights replication
ModuLoop : Low-Level Code Generation using Modular Synthesizer and Closed-Loop Debugger for Robotic Control
Large Language Models (LLMs) have demonstrated impressive performance across various domains, including code generation and problem solving. However, their application in robotic control, particularly in low-level tasks that require precise manipulation, real-time feedback, and environment-dependent execution, remains limited. To address this challenge, we propose the Closed-Loop Modular Code Synthesizer framework. This framework leverages a pre-trained LLM without any task-specific fine-tuning to perform modular code planning and generation, and iteratively executes the generated code while inserting debugging probes to observe its behavior. This closed-loop structure facilitates systematic debugging and refinement, ultimately producing executable control programs. We apply the proposed framework to the calibration of an RGB-D camera and a robotic arm, validating its effectiveness in real-world settings. Furthermore, through a subsequent pick-and-place task, we demonstrate not only the accuracy of the calibration but also the potential extensibility of the framework. Across both tasks, the framework achieved high execution accuracy and autonomy, illustrating the practicality and scalability of LLM-based robotic control using our framework.
comment: IEEE Robotics and Automation Letters (2025)
Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks
Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor. We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today's agent protocols. We make four contributions: (1) a failure taxonomy that names confident-wrong as a non-adversarial, correlated subclass of Byzantine faults that classical fault-tolerance mismodels; (2) a market-for-lemons model showing that faith-based protocols admit only a low-trust equilibrium; (3) the Trust Layer, a thin, protocol-agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability-composition bound for delegation chains with an end-to-end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.
AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.
Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics
The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.
comment: Accepted at the Reinforcement Learning Conference 2026
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills -- composable packages of instructions, code, and resources that agents load on demand -- enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL$.$md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework -- a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges -- from cross-platform skill portability to capability-based permission models -- and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills
comment: Accepted by Agent Skills '26 Workshop at ACM Conference on AI and Agentic Systems 2026
QoEReasoner: An Agentic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs
Diagnosing Quality-of-Experience (QoE) degradations in operational Radio Access Networks (RANs) is a critical but notoriously complex task, traditionally requiring labor-intensive expert analysis over high-dimensional, cross-layer telemetry. While Large Language Models (LLMs) offer unprecedented reasoning capabilities, they are fundamentally unsuited for raw RANs troubleshooting: they fail at numeric time-series analysis, hallucinate protocol-violating causal links, and lack the stateful rigor required for multi-step fault localization. To bridge this gap, we present QoEReasoner, an end-to-end, LLM-driven agentic system designed for automated and explainable QoE diagnosis. QoEReasoner tames the inherent unpredictability of LLMs by grounding their reasoning in the physical realities of the network. It employs deterministic tools to reliably translate raw numeric KPIs into structured evidence, enforces protocol-consistent fault propagation through a domain-specific Knowledge Base, and leverages a Historical Bank of expert-validated cases to guide hypothesis generation. A stateful central planner orchestrates this closed-loop process across anomaly detection, causal tracing, and root-cause localization. Evaluations on real-world operational RANs datasets demonstrate that QoEReasoner outperforms strong baselines by 18\%-40\% in accuracy across multiple diagnostic tasks. Furthermore, it reduces diagnostic time from approximately 30 minutes of manual expert analysis to just 3 minutes per session, delivering highly interpretable, expert-grade reports while remaining robust across diverse LLM backbones.
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations management. Although behavioral experiments have revealed cognitive biases behind supply chain inefficiencies, traditional methods face scalability and control limitations. We introduce a scalable experimental paradigm using Large Language Models (LLMs) to simulate multi-stage supply chain dynamics. Grounded in a Hierarchical Reasoning Framework, this study specifically analyzes the impact of cognitive heterogeneity on agent interactions. Unlike prior homogeneous settings, we employ DeepSeek and GPT agents to systematically vary reasoning sophistication across supply chain tiers. Through rigorously replicated and statistically validated simulations, we investigate how this cognitive diversity influences collective outcomes. Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects. Our findings extend traditional behavioral methods and offer new insights into the dynamics of AI-enabled organizations. This work underscores both the potential and limitations of LLM-based agents as proxies for human decision-making in complex operational environments.
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.
comment: Add an accelerated variant of the proposed method. New proofs of proposed methods
Measuring Weak-to-Strong Legibility of Reasoning Models ICML 2026
Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.
comment: Accepted to Trustworthy AI4GOOD Workshop @ ICML 2026
RocketSmith: An Agentic System for High-Powered Rocket Design and Manufacturing
This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.
Systems and Control (EESS)
State Observers for Linear Systems with Prescribed Residual Bounds
This paper presents a state observer design for continuous linear time-invariant (LTI) systems subject to unknown bounded disturbances, that enforces a prescribed bound on the observer residual. The proposed observer augments a continuous-time Luenberger observer with state resets, triggered when the norm of the residual equals a pre-specified bound. The reset map guarantees contraction of the residual at jump instants while preserving the uniform boundedness properties of a standard Luenberger observer. The paper also establishes forward invariance of the residual envelope and non-expansiveness of the estimation error in a Lyapunov metric. Simulation results confirm the analysis. Under bounded disturbances, the residual stays within the prescribed bound. A standard Luenberger observer with the same gains violates this bound.
Adaptive arrival cost update for improving Moving Horizon Estimation performance
Moving horizon estimation is an efficient technique to estimate states and parameters of constrained dynamical systems. It relies on the solution of a finite horizon optimization problem to compute the estimates, providing a natural framework to handle bounds and constraints on estimates, noises and parameters. However, the approximation of the arrival cost and its updating mechanism are an active research topic. The arrival cost is very important because it provides a mean to incorporate information from previous measurements to the current estimates and it is difficult to estimate its true value. In this work, we exploit the features of adaptive estimation methods to update the parameters of the arrival cost. We show that, having a better approximation of the arrival cost, the size of the optimization problem can be significantly reduced guaranteeing the stability and convergence of the estimates. These properties are illustrated through simulation studies.
AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation
Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.
Multi-Robot Bearing-only Pose Estimation via Angle Rigidity
This letter proposes a novel distributed bearing-based pose estimator for time-varying multi-robot systems. The method uses angles computed from body-frame bearings to estimate the robots' positions in $\mathbb{R}^3$ without knowledge of their orientations. The orientations in $\mathrm{SO}(3)$ are recovered from the estimated positions, the bearings, and the bearing derivatives. The proposed observer only requires the (directed) sensing topology to be \textit{angle-rigid}, a weaker condition than the commonly used ones like bearing rigidity. Local uniform exponential stability of the proposed observer is established under the assumption of persistently exciting motions for a subset of robots. Simulations are presented and discussed to evaluate the scheme's effectiveness and practicality.
A Dynamic Capacity Allocation Model for DERs under Non-Firm Connection Agreements
The growing penetration of distributed energy resources (DERs) intensifies congestion in distribution networks by introducing bidirectional power flows and increasing competition for limited network capacity, underscoring the need for effective and efficient congestion management, including flexible grid-access schemes. This paper proposes a bilevel optimization model for the dynamic allocation of connection capacity to DERs under non-firm connection agreements, aligning the objectives of distribution system operator (DSO) and DER owners. The upper-level problem, representing the DSO, determines the allocated connection capacity for all DERs, defined as maximum time-varying power limits, subject to distribution system constraints and the last-in-first-out (LIFO) allocation rule. The lower-level problem, representing DER owners, maximizes the profit of each DER within the allocated power limits. The proposed model is tested on a modified CIGRE medium-voltage (MV) network, demonstrating a balanced trade-off between grid utilization and economic efficiency. Furthermore, the model enhances DER integration, enforces transparent allocation rules, reduces variability in allocation patterns, and achieves up to an 80% reduction in total curtailment costs compared with benchmark methods.
NeuroSymbolic Robustness Analysis for Discrete Systems with Respect to Transition Deviations
Supervisory control of discrete-event systems provides formal guarantees of correctness with respect to a plant model and specification. However, these guarantees heavily rely on the plant model, which could deviate from nominal behavior due to modeling errors or faults. Recent notions of discrete robustness model deviations as a set of additional transitions that are added to the plant. The discrete robustness is defined as all sets of extra transitions for which the supervised system still guarantees a desired specification. However, this notion suffers from scalability due to the large solution space and conservatism since most deviations are infeasible in practice. This paper proposes to address these two issues using a neurosymbolic computing framework for discrete robustness analysis of safety properties. First, a neural reasoning layer based on Large Language Models infers a set of feasible deviation transitions from system models, specifications, and domain knowledge. Next, a symbolic layer computes the discrete robustness guarantees over the inferred deviation set. We evaluate our framework on three case studies, demonstrating that our method identifies a smaller set of feasible deviations while preserving robustness guarantees comparable to those of full transition-based analysis.
APX-Hardness of Computing Lipschitz Constants for Multi-Parametric Quadratic Programs
Computing the Lipschitz constant of the solution map of a multi-parametric quadratic program is important for the analysis of optimization-based control. This problem is governed by three factors: the parameter dimension, the number of decision variables, and the number of constraints. While empirical evidence has long suggested exponential complexity, a rigorous complexity-theoretic proof has been lacking. In this paper, we fill this gap by proving that this problem is not only NP-hard but also APX-hard. Furthermore, we reveal that: (a) the problem becomes polynomial-time solvable when the number of constraints or decision variables is fixed; and (b) both NP-hardness and APX-hardness persist even in the scalar parameter case. These results confirm that the complexity stems from the number of constraints and variables, rather than the parameter dimension. Numerical experiments further validate these theoretical findings.
An Integrated Techno-Economic Framework for Optimal Microgrid Design: An Australian Case Study
Reliable and affordable electricity supply remains a challenge for remote and regional communities, motivating the deployment of renewable-based microgrids supported by flexible storage and advanced planning methods. This paper proposes an integrated techno-economic framework for optimal microgrid design and robustness assessment, and applies it to a 1000-household residential community in Rockhampton, Queensland (Australia). The framework links time-series simulation, dispatch-based operation, and lifecycle costing to evaluate hybrid configurations comprising photovoltaic and wind generation, battery storage, diesel backup, grid exchange, and an optional hydrogen subsystem (electrolyzer--hydrogen storage--fuel cell). Key indicators include net present cost (NPC), cost of energy (COE), renewable penetration, energy purchased/sold, and emissions-related outcomes. To avoid conclusions that depend on a single set of assumptions, the study performs systematic sensitivity analysis across financial, technical and policy drivers: discount rate, technology capital costs, fuel price, load uncertainty, renewable resource variability, carbon pricing/emissions cost, and grid outage duration, supplemented by a no-hydrogen attribution case. The results demonstrate that several sensitivity dimensions induce nonlinear shifts in the optimal design, including breakpoints where capital-intensive renewable--storage expansion becomes economically preferable. The proposed framework enables transparent comparison of hydrogen-enabled and battery-centric solutions and provides planning guidance for resilient, low-emission community microgrids under Australian operating conditions.
CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles
Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud-hosted intelligence through vehicle-to-everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems-level challenges: network latency, compute heterogeneity, and multi-tenant contention, all critically affect real-time decision-making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state-of-the-art models, incorporates trace-driven network and workload emulation, and provides synchronized model-, system-, and task-level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud-based perception and RSU-assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset-driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at https://nesl.github.io/cadet-web.
When are supercapacitors practically feasible in electric vehicles?
While the hybrid energy storage system (HESS) can theoretically mitigate battery degradation in electric vehicles, its practical implementation remains highly limited. To delineate the specific scenarios and application boundaries where supercapacitors remain feasible, this study proposes a multi-dimensional techno-economic feasibility evaluation framework. First, a cross-vehicle sizing method based on dynamic programming is established to quantify physical mass-volume packaging constraints and identify feasible supercapacitor candidates across different vehicle types. Building upon the optimal sizing parameters derived from the battery aging Pareto front, an expert-guided deep reinforcement learning energy management strategy is integrated to yield near-optimal online performance, ensuring a fair life-cycle economic assessment. Finally, a comprehensive feasibility matrix is constructed to systematically evaluate mass, volume, battery lifespan, additional supercapacitor costs, total cost of ownership, future energy storage prices, and the influence of emerging solid-state batteries. Results reveal that city buses remain the most promising vehicle type for HESS due to minimal additional costs and sufficient packaging space. Current mass-volume penalties and limited economic benefits hinder HESS application in passenger vehicles and heavy-duty trucks, respectively. This situation may only improve if supercapacitor prices drop significantly in the future. Beyond vehicle types, the HESS feasibility is governed by load-frequency characteristics. Furthermore, looking toward the 2030+ solid-state battery era, we highlight that integrating increasingly affordable supercapacitors can provide substantial asset protection leverage.
comment: 15 pages, 14 figures, about 6900 words
Admittance Sensitivity-Informed Modular GP for Scalable Topology-Adaptive Power-Flow Learning
Data-driven approaches for learning power flow models suffer from weak generalization across varying network topologies and limited computational scalability. Existing methods typically rely on training over a large set of grid topologies, which becomes impractical for large networks. This paper proposes a scalable and computationally efficient framework for topology-adaptive learning of power flow solutions. We propose a modular architecture consisting of bus-level Gaussian Process (GP) models, where each GP collects local features based on bus-level \textit{egonet} definition. The localized bus-level feature includes first-order power and admittance sensitivities, nodal injections and node degree. In addition to the modular architecture, we propose using Random Fourier Features (RFF) for feature reduction, which further enhances the computational scalability. We evaluate the effectiveness of the proposed method by simulations across multiple benchmark networks under N-1, N-2, and N-3 contingencies. Results for the PEGASE 1354 bus system under N-3 contingencies demonstrate high predictive quality, with an $R^2$ score of 0.983 and a voltage-magnitude RMSE of 0.0023 p.u. The framework maintains recall rates exceeding 98\% for detecting voltage limit violations across all test cases. Furthermore, the approach exhibits scalability, completing training and testing for the PEGASE 1354 system in 116.47 seconds while outperforming existing benchmarks in zero-shot generalization without requiring additional training samples.
From Well-Posed Inversion to Learning Design: Physics- Informed Neural Estimation for Autonomic Regulation
Learning-based and physics-informed methods are increasingly used for inverse estimation in controlled nonlinear dynamical systems. However, in many such approaches, the theoretic requirements that make unknown-input reconstruction meaningful, namely well-posedness in the sense of Hadamard, are often disregarded or weakly addressed through generic regularization terms with no explicit guarantees. In this work, we adopt a complementary viewpoint in which these control-theoretic and structural conditions inform the estimator design and constrain its training. We thus develop a physics-informed input-state neural estimator for joint unknown-input and state estimation in nonlinear controlled systems with partial measurements. In the present work, this general framework is instantiated on a model of autonomic cardiac regulation, provides a concrete study case. The estimator is formulated as an inverse neural map conditioned on time and measured outputs, and is trained under data fidelity and dynamical consistency constraints. To ensure it complies with the same structural requirements imposed in robust estimation, we derive left-invertibility conditions by differential-algebraic elimination and embed the resulting constraints directly into the training objective. We further analyze a priori the stability of the inverse mapping to output perturbations and derive a conservative Lipschitz bound that guides the tuning of cost functional hyper-parameters. The framework is evaluated on simulated data, where ground truth data is available, and on two distinct datasets of real cardiovascular recordings. The results show that incorporating control-theoretic solvability constraints into physics-informed learning improves the reliability of inverse inference beyond forward consistency alone.
comment: 16 pages
Optimal Finite-Horizon LQR Control for Traffic Flow via Variable Speed Limits
This article presents a finite-horizon linear quadratic regulator for the control of the first-order Lighthill-Whitham-Richards traffic model with a triangular fundamental diagram. The in-domain control action is realized through variable speed limits implemented as a source term in the governing hyperbolic partial differential equation. Unlike prior studies on infinite-horizon formulations, this article develops a finite-horizon LQR framework, deriving a space and time varying state feedback function for hyperbolic PDEs. The solution to the finite time optimal control problem relies on the solution of another PDE, called the Riccati PDE. The resulting nonlinear Riccati PDE is solved analytically via the parametric method of characteristics. The Riccati PDE solution is a function of both time and space, as well as the traffic regime. A sensitivity analysis demonstrates the effects of the LQR parameters for both the infinite and finite time horizon problem in different traffic situations, while siulations validate the finite-horizon LQR's ability to guarentee finite-time convergence. Comapred to the infinite-horizon LQR, the proposed approach achieves significantly improved control performance across various scenarios, making it particularly suitable for time-sensitive traffic management applications.
comment: 10 pages, 26 figures, submitted to IEEE Transactions on Control Systems Technology
Enhancing Offshore Wind Simulations: Interpolated Switching via DLL Black-Boxes
The modern power system, increasingly composed of Inverter-Based Resources (IBR) from multiple manufacturers, requires new study and design techniques that balance accuracy with the need to protect the Intellectual Property (IP) of various stakeholders. One possible method to support detailed electromagnetic transient (EMT) simulations is to convert the original equipment manufacturers (OEM) models into shareable black-box versions using dynamic link libraries (DLLs). This technique prevents IP violations while potentially maintaining simulation accuracy by embedding the original components within the shareable DLL. Thereby, this work aims explicitly to enhance simulation fidelity by translating full-switching models of offshore wind turbines (OWTs). In this context, the paper offers valuable recommendations, including how to convert interpolation-based elements, preserve simulation speed, recognize limitations, and outline future improvements
comment: In Review at IET Renewable Power Generation
Semidefinite Programming Certificates for Synchronization of Kuramoto Oscillators on Arcs
A class of Kuramoto models with a general coupling function that can be expressed in terms of a finite number of harmonics, each comprising sinusoidal terms, is studied. We propose a novel approach for certifying local phase synchronization in this class for all initial conditions lying on an arc. The trace parametrization property and Gram matrix representation of a trigonometric polynomial are utilized along with Putinar's Positivstellensatz to obtain semidefinite programming certificates for the stability of the phase-difference system, which in turn implies synchronization of the original system. The results can be extended to any system of coupled oscillators where the forward-invariance on arcs can be established.
comment: A version of this work has been accepted for publication in Chaos and Complex Systems: Proceedings of the 6th International Interdisciplinary Chaos Symposium
Systematic Gray-Box Identification Methodology for Voltage Source Converters
This paper introduces a systematic gray-box identification framework for voltage-source converter models based solely on terminal time-series data. The proposed approach combines a physically informed white-box standard model with iterative time-domain calibration to estimate controller parameters that mimic the behavior of the black-box model in electromagnetic transient (EMT) simulations. Unlike conventional frequency-domain identification methods, the framework leverages time-domain data more effectively to better constrain the surrogate model across a broader operating range and capture reference-signal dynamics. To evaluate the accuracy of the identified model, the paper presents additional frequency-domain validation metrics based on Nyquist analysis and singular value decomposition, allowing for both quantitative assessment of model divergence and qualitative classification of mismatch types. The methodology is tested on cases with increasing structural uncertainty, from exact parametric recovery to an actual detailed EMT black-box model. Results demonstrate that the proposed framework can accurately recover parameters when the internal structure is known, adjust for moderate structural mismatch with extra degrees of freedom, and offer a reliability measure for small-signal stability analysis of converter models protected by intellectual property
comment: Submitted to IEEE Transactions on Power Delivery
Estimation of Equivalent SCR for Offshore Wind
The integration of offshore wind power plants (OW-PPs) into weak grids can pose stability challenges due to the interaction between inverter-based resources (IBRs), Flexible AC Transmission Systems (FACTS) and the grid. In this context, long HVAC transmission systems, relatively common for OWPPs, can exacerbate the stability challenges. Therefore, this paper introduces a novel methodology for estimating the equivalent short-circuit ratio (ESCR) at the offshore point of connection (PoC), combining analytical two-port network (TPN) modeling with electromagnetic transient (EMT) simulations. The approach derives the Thevenin equivalent impedance for passive and active components, enabling accurate ESCR computations without complex derivations. Limitations of traditional SCR metrics are addressed by incorporating the dynamics of the converters, such as static synchronous compensators (STATCOMs), into a hybrid EMT-TPN method for synthesizing equivalent impedances. The process is then verified on the CIGRE OWPP benchmark and is found to capture ESCR variations with cable lengths, shunt reactors, and grid strength. Additionally, the results emphasize the correlation between the ESCR and voltage stability, highlighting the role of STATCOMs in supporting voltage stability in weak grids. This modular framework aids in OWPP design and stability analysis for converter-dominated systems.
comment: Accepted at 24th Power Systems Computation Conference (2026 Cyprus) and Electric Power System Research
Recursive Learning of Feedforward and Compliance Compensation Parameters for Precision Motion Systems
To meet the stringent requirements of future motion systems exhibiting time-varying and/or position-dependent behavior, online data must be leveraged to improve control performance. This paper presents a recursive algorithm for simultaneous learning of feedforward and compliance compensation parameters. A multivariate regression formulation is proposed that jointly estimates friction, mass, jerk, and compliance compensation parameters while mitigating parameter coupling. Experimental results on a high-tech semiconductor metrology and inspection system demonstrate an order-of-magnitude improvement in servo performance.
Unstable Poles Arising in AC Power Grid Subsystem Representations
Recent small-signal stability studies of AC grids have shifted towards analysing power systems as interconnections of subsystems and leveraging their input-output properties to derive scalable stability certificates. Two subsystem representations appear frequently in the literature: the PQ model, coupling powers to phase angle and voltage magnitude, and the IV model, coupling currents to voltages. In this paper, we derive both models without simplifying the bus or line dynamics and show that a loop transformation relates the two. One of the main results in the paper is to then show analytically that each representation may exhibit unstable poles depending primarily on the operating point (IV model) or the presence of high-frequency passive dynamics (PQ model). In particular, such unstable poles in the subsystems can occur even when the aggregate interconnection is stable and well-behaved. These effects are validated numerically, including a case study using the full-order dynamics of a synchronous generator with an exciter and transformer. Our results highlight that care must be taken when choosing a subsystem representation, as neglecting high-frequency dynamics or device operating points may obscure unstable poles that must be stabilised by the network interconnection and must be accounted for in system identification.
comment: Submitted to IEEE Conference on Decision and Control (CDC) 2026
Enhancing Collective Self-Consumption through Water Storage Heater Flexibility
While Renewable Energy Communities (RECs) and Collective Self-Consumption (CSC) schemes have emerged as promising tools to accelerate renewable energy adoption and support the net-zero transition, their full potential can only be realised when complemented by demand-side flexibility that aligns consumption with renewable generation. Water storage heaters can function as distributed thermal storage, absorbing excess renewable energy at the community level. This work quantifies both the benefits of water storage heaters flexibility for energy consumers in a CSC community in France (such as energy bill reduction, increase of self-consumption), and the challenges related to the implementation and user acceptance. At the first stage, an annual simulation analysis is performed on a community of 41 households and a large solar PV plant, comparing a scenario without a CSC community, a scenario with a standard CSC community, and a scenario with a CSC community with flexibility from water storage heaters, which showed that an average benefit of 70euro/year per household can be achieved due to flexibility and an increase of 6% and 22% of solar PV community self-consumption and self-production respectively. In the second stage, we present the results of the real-world deployment in the community, analysing its technical performance and user reception, and examine the main factors shaping user engagement and satisfaction.
Secrecy Sum Rate Maximization for OIRS-Aided Visible Light Communications with Confidential Messages
This paper investigates the secrecy sum-rate (SSR) performance of optical intelligent reflecting surface (OIRS)-assisted multi-user visible light communication (VLC) systems under line-of-sight (LoS) blockages. To mitigate physical obstructions and internal eavesdropping, a joint optimization problem is formulated to maximize the SSR through the co-design of the transmission precoder and OIRS units assignment. Due to the binary constraints and coupled variables, the problem is highly non-convex. To solve it efficiently, an alternating optimization (AO) framework integrating the concave-convex procedure (CCCP) and first-order Taylor approximations is developed. Simulation results demonstrate the convergence of the proposed algorithm and show that increasing the number of OIRS reflecting units yields significant SSR gains.
Surrogate Modeling of Interconnector Flows: A Machine Learning Alternative to Full-Scale Power System Simulations with Application to Cross-Border Electricity Exchange
Cross-border electricity exchanges are crucial for operating and planning highly renewable power systems. Many studies reduce spatial granularity to keep the models tractable and prescribe cross-border exchanges exogenously, often by reusing historical import/export time series. Such assumptions become inconsistent as renewable penetration changes the magnitude and timing of flows. This paper proposes a machine-learning (ML) surrogate framework that maps available nodal time series data (e.g., hourly demand and renewable generation) to synthetic, interconnector-level flow time series. The goal is to provide consistent flow profiles that are used as fixed boundary conditions in reduced power system optimization models (PSOMs). To improve downstream feasibility when surrogate flows are imposed in optimization, we further introduce a custom loss for the neural-network surrogate that penalizes physically impossible flow patterns. We demonstrate the framework on a pan-European single-node per country DC optimal power flow setting using the open-source LEGO PSOM with ENTSO-E TYNDP 2024 National Trends assumptions for 2030. We assess two model classes: k-nearest neighbors (KNN) and feedforward neural networks (SQU), using both full and reduced feature sets. The SQU models generalize more robustly than KNN to unseen climate years and substantially improve upon scaled historical benchmarks in terms of predictive accuracy. When imposed as fixed boundary flows in single-node PSOMs, the ML-generated profiles produce outcomes that closely match the results of the full European simulation, while delivering substantial runtime reductions (up to ~500x). These results indicate that ML-based flow surrogates can provide decision-relevant interconnector flows for tractable reduced studies in high-renewable systems.
Navigating the unknown in large-scale operational transformation programs: The "Sirius Days" framework as a 'pilot-organization' for characterizing emerging issues
Large-scale digital transformation programs must simultaneously sustain existing operations and navigate deep unknowns emerging from IT-business-operations interactions -a challenge conventional project governance frameworks inadequately address. Based on a longitudinal case study of a transformation program, we investigate the ''Sirius Days,'' a monthly senior management retreat identified as a critical success factor. We show that this framework constitutes a pilot-organization: an organizational 'dispositif' (or apparatus) that deconstructs established knowledge or assumptions, formulates rigorous conjectures, and tests them in real conditions. It generated five resilience levers -systemic characterization of unknowns, early anomaly discernment, expansion of performance norms, social capital creation through a community of inquiry, and expansion of organizational agility across scales -revealing a model of an organizational 'dispositif' that operationalizes navigating unknowns across cognitive, social, and normative dimensions.
Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift
Artificial-intelligence surrogates can support second-by-second thermal-hydraulic forecasting, but models selected and frozen offline may become condition-locked once deployed outside their pretraining envelope. This study develops a guarded continual-adaptation framework for experimental thermal-hydraulic loop data in which role-separated agents - Monitor, Diagnosis, Adaptation, Safety-Auditor, and Orchestrator - diagnose error signatures, prioritize candidate model families, and review promotions, while deterministic champion-challenger gates and background shadow learning retain final authority over model replacement. Seven surrogate families were screened by blocked three-fold cross-validation, and a temporal Fourier neural operator was selected as the initial champion for 60-s-history-to-10-s-trajectory forecasting on two held-out transients, with three seeds per adaptive mode. Static deployment gave a channel-averaged MAE of 7.06 and a 56.8% warning-exceedance ratio; rule-based adaptation reduced MAE to 6.54, whereas shadow refresh alone remained close to Static. The MA-Full mode, in which the role-separated multi-agent council reviews every evaluated stream step, achieved the lowest mean error, 5.72, and 35.8% exceedance, corresponding to a 19.0% improvement over Static. Paired bootstrap intervals against Static excluded zero, although intervals among adaptive modes overlapped and the six paired units limit broad statistical claims. Validated promotions from the neural operator to Transformer and graph neural network indicate that logged, gate-controlled adaptation can support auditable surrogate evolution while deterministic gates retain deployment authority.
Equivalent Circuit Model based Electric Vehicle Evacuation with Mobile Charging Stations
The increasing penetration of electric vehicles (EVs) introduces new challenges for emergency evacuation planning due to limited driving range, long charging times, and constrained charging infrastructure, particularly under disaster induced disruptions. This paper proposes a novel optimization based evacuation framework for EVs using Equivalent Circuit Models (ECMs) to jointly address routing, charging, and congestion management. By leveraging electrical analogies, traffic flow is modeled as electrical current, travel time as resistance, and driving range as voltage, enabling the use of Kirchhoff laws to enforce flow balance and energy feasibility constraints. The proposed controllable ECM incorporates binary switches to regulate route selection and explicitly models charging delays and range replenishment at both Fixed Charging Stations (FCSs) and Mobile Charging Stations (MCSs). The resulting formulation leads to an integer programming problem that determines optimal evacuation routes, charging durations, and the placement and number of MCSs to minimize evacuation time. The framework is extended to multiple origin destination pairs using the principle of superposition and supports fairness aware performance metrics, including worst case, average, and variance based evacuation times. Simulation studies on large scale transportation networks in California demonstrate that the proposed approach significantly improves evacuation efficiency and robustness, particularly in scenarios with limited charging access, highlighting the critical role of MCSs in EV based emergency evacuations.
Dynamics of the Thermomagnetic Pendulum
A thermomagnetic pendulum is introduced as a coupled thermo-magnetic-mechanical system consisting of a ferromagnetic bob under gravity and an offset permanent magnet. Heating drives the bob temperature above and below the Curie point, causing magnetic attraction to vanish and recover as the bob moves and cools. A multiphysics model is developed in which the magnetic torque depends nonlinearly on the bob temperature field and pendulum configuration. The formulation couples transient three-dimensional heat transfer, a temperature-dependent magnetization law, and pendulum dynamics. Simulations show angular torque asymmetry, rapid force reduction near the Curie point, and sustained oscillations.
Learning Local Optimal Controller for a Class of Nonlinear Systems via Impulse-Supervised Exploration
This paper develops an impulse-supervised confined exploration framework for learning local optimal controller for a class of nonlinear systems. The proposed approach combines continuous-time approximate dynamic programming (ADP) with an impulsive supervisory layer, where impulsive braking confines the state within a prescribed region in which a local linear approximation of the nonlinear system is valid. This enables desired persistent excitation required for parameter convergence while preventing large state deviations that invalidate local optimality. The resulting hybrid closed-loop system enforces invariance of the exploration region through state-triggered braking inputs. Simulation results on a nonlinear mechanical system demonstrate effectiveness of the proposed approach.
Impedance Modeling and Stability Analysis of Droop-Controlled Inverter Under Unbalanced Power Grid Operating Conditions
With the growing integration of renewable energy sources into power grids, the risks of oscillation caused by interactions between grid-tied inverters and the grids are becoming increasingly prominent. Although existing studies have made significant progress in inverter modeling and oscillatory stability analysis, most of them do not sufficiently consider complex mirror frequency coupling effects (MFCE) under unbalanced operating conditions, leading to unreliable models and erroneous stability analysis results. To address this inadequacy, this work develops a novel sequence impedance modeling scheme that can be widely applied to unbalanced operating conditions. In particular, taking a representative type of grid-forming inverter for instance, i.e., droop-controlled inverter (DCI), a single-input single-output sequence impedance modeling method based on harmonic linearization (HL) is proposed to comprehensively model both a given DCI and the connected grid. By accounting for multi-frequency interactions within the DCI, this method captures MFCE and unbalanced factors, leading to a more accurate impedance model. Further, the dominant factors influencing system stability are identified with a combination of normalized sensitivity analysis and proportional weighting. Finally, the detailed impacts of these dominant factors on system stability margin under three typical unbalanced operating conditions are analyzed through the Bode criterion. The effectiveness and reliability of the whole scheme proposed in this work are validated on the constructed grid-connected droop-controlled experimental platform.
comment: 12 pages, accepted for publication in IEEE Transactions on Industrial Electronics
Observer-Based Control of Linear Systems with Mismatched Input and Output Delays
This paper investigates the stabilization of linear systems subject to simultaneous, mismatched time delays in both the control input and system output vectors. The proposed control framework is developed in two primary stages. First, an asymptotically stabilizing delayed state-feedback controller is synthesized by leveraging recent advancements in Linear Matrix Inequality (LMI) techniques. Second, this controller is realized using novel time-delay compensators \cite{trinhnam26}. This architecture successfully accommodates an output measurement delay $τ_y$ that is independent of the input delay $τ_u$, enabling direct estimation of the delayed state-feedback control law. The proposed methodology is then extended to target output controllers to account for simultaneous, mismatched time delays in both the control input and system output vectors.
comment: Preprint of a chapter intended for a forthcoming research monograph
RMPrior: Bridging Propagation Priors and Diffusion Refinement for Efficient Radio Map Construction
Diffusion models achieve high-fidelity radio map construction through iterative denoising, yet their sampling cost limits practicality in dynamic wireless systems where radio maps must be refreshed repeatedly. Meanwhile, classical propagation models encode valuable scene-level knowledge that standard diffusion inference discards entirely by initializing from pure Gaussian noise. This paper bridges propagation priors and diffusion refinement through a mid-start sampling strategy. A matched propagation prior is perturbed to an intermediate diffusion timestep, and the pretrained diffusion backbone executes only the remaining reverse steps, focusing computation on multipath-aware refinement rather than full reconstruction from noise. We provide theoretical analysis establishing an upper bound on the initialization gap, a sufficient condition under which truncation improves reconstruction fidelity, and a formal characterization of prior-quality sensitivity under aggressive truncation. Experiments on IRT4HighRes show that, at $P_{\text{start}}=0.5$, the proposed method achieves a $2.01\times$ speedup while simultaneously improving NMSE, RMSE, SSIM, and PSNR over the full-step baseline. A prior-quality ablation across three propagation models of different fidelity confirms that reconstruction quality tracks prior quality, with the sensitivity amplified under shorter reverse trajectories, consistent with the theoretical predictions. These results also suggest that mid-start reconstruction quality can serve as a proxy for ranking the scene-level fidelity of different propagation models.
Brief Announcement: Generative Markov Model for Distributed Computing Systems SC 2026
Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.
comment: Submitted to 40th International Symposium on Distributed Computing (DISC 2026)
Rotatable Antenna Meets Multiple Access: NOMA or OMA?
Rotatable antenna (RA) technology has emerged as a promising solution to enhance spectrum efficiency by exploiting additional spatial degrees of freedom (DoFs) in multiple access networks. However, the relative performance superiority among different multiple access schemes remains largely unclear due to the unique capability of RA in reconfiguring the directional gain pattern. In this letter, we conduct a theoretical comparison between non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA) schemes in RA-assisted communication systems in terms of transmit power minimization, subject to constraints on antenna rotational range and users' target rates. To address the associated non-convex optimization problem, a particle swarm optimization (PSO) algorithm is employed to optimize the rotational angle. Simulation results demonstrate that RA-assisted schemes significantly reduce transmit power compared to fixed-antenna benchmarks. Furthermore, RA-assisted NOMA may perform worse than time-division multiple access (TDMA) for symmetric user deployments, while it exhibits superior robustness and energy efficiency in asymmetric scenarios.
Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence
The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible? We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft. We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.
comment: 12 pages, 2 figures, 2 tables, 32 references. Paper 1 of the Project October series on autonomous orbital intelligence
Certified Neural Approximations of Nonlinear Dynamics
Neural networks hold great potential to act as approximate models of nonlinear dynamical systems, with the resulting neural approximations enabling verification and control of such systems. However, in safety-critical contexts, the use of neural approximations requires formal bounds on their closeness to the underlying system. To address this fundamental challenge, we propose a novel, adaptive, and parallelizable verification method based on certified first-order models. Our approach provides formal error bounds on the neural approximations of dynamical systems, allowing them to be safely employed as surrogates by interpreting the error bound as bounded disturbances acting on the approximated dynamics. We demonstrate the effectiveness and scalability of our method on a range of established benchmarks from the literature, showing that it significantly outperforms the state of the art. Furthermore, we show that our framework can successfully address additional scenarios previously intractable for existing methods -- neural network compression and an autoencoder-based deep learning architecture for training Koopman operators for the purpose of trajectory prediction.
comment: first and second author contributed equally
Backstepping Control of First-Order Hyperbolic Equations in Arbitrary Dimensions with Non-Trapping Characteristics
This paper presents a backstepping approach for the boundary control of first-order hyperbolic equations with spatially varying coefficients posed on domains of arbitrary dimension. The method is based on a change of variables induced by the characteristic flow of the time-invariant transport operator, transforming the original multidimensional system into a continuum of decoupled one-dimensional hyperbolic equations evolving along individual characteristic curves. A backstepping controller is then designed for each equation in the decomposition, and the resulting control laws are reassembled in the original coordinates to achieve finite-time stabilization of the full system. The framework relies on the existence of characteristic curves foliating the spatial domain, with uniformly bounded transit times (non-trapping).
Learning to Adapt Control Barrier Functions Under Epistemic and Aleatoric Uncertainty
Control barrier functions (CBFs) provide a tractable mechanism for enforcing safety constraints in robotic systems, but their practical performance depends strongly on the choice of class-K function parameters. Under input constraints, conservative parameters often preserve feasibility at the cost of slow progress, whereas aggressive parameters can make the CBF-based optimization infeasible or unsafe. This paper proposes Online Adaptive CBF (OA-CBF), a framework for adapting CBF parameters at runtime. We introduce the notion of locally validated CBF parameters, which certify candidate parameters over a finite prediction horizon, and show that safety is preserved when such validation is maintained over successive update intervals. To identify locally validated parameters efficiently, OA-CBF trains a probabilistic ensemble neural network to evaluate queried CBF parameters rather than directly predict a single parameter. A graph-attention encoder represents variable-size obstacle environments, an epistemic uncertainty gate calibrated by conformal prediction rejects unreliable predictions, and a distributionally robust CVaR condition screens aleatoric risk. Among the verified candidates, OA-CBF selects the parameter with the best predicted progress metric and applies it through either an MPC-CBF or CBF-QP safety filter. Simulation studies on dynamic unicycle, planar and three-dimensional quadrotor, kinematic bicycle, and VTOL quadplane benchmarks show that OA-CBF reduces the conservatism of fixed-parameter CBF controllers while maintaining low collision and infeasibility rates.
comment: Extended journal version of the IEEE CDC 2025 paper (available as arXiv:2504.03038v5). Project page: https://www.taekyung.me/oa-cbf
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success
A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a $χ^2$ divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.
Interpolatory Approximations of PMU Data: Dimension Reduction and Pilot Selection
This work investigates the reduction of phasor measurement unit (PMU) data through low-rank matrix approximations. To reconstruct a PMU data matrix from fewer measurements, we propose the framework of interpolatory matrix decompositions (IDs). In contrast to methods relying on principal component analysis or singular value decomposition, IDs recover the complete data matrix using only a few of its rows (PMU datastreams) and/or a few of its columns (snapshots in time). This row-/column-based compression enables real-time monitoring of power transmission systems using measurements from a smaller subset of pilot datastreams, thereby minimizing communication bandwidth. The ID perspective gives a rigorous error bound on the quality of the data compression. We propose selecting the pilot measurements used in an ID via the discrete empirical interpolation method (DEIM), a greedy algorithm that aims to control the error bound. This bound yields a computable estimate of the reconstruction error during online operations. A violation of this estimate suggests a change in the system's operating conditions and thus serves as a tool for fault detection. Following a disturbance, DEIM can be used to localize the event source across all buses with high accuracy. Numerical tests on synthetic PMU data demonstrate DEIM's excellent performance in data compression and validate the proposed DEIM-based fault-detection and localization method.
$\mathcal{H}_2$-optimal model reduction of linear quadratic-output systems by multivariate rational interpolation
This paper addresses the $\mathcal{H}_2$-optimal approximation of linear dynamical systems with quadratic-output functions, also known as linear quadratic-output systems. Our major contributions are threefold. First, we derive interpolatory first-order optimality conditions for the linear quadratic-output $\mathcal{H}_2$ minimization problem. These conditions correspond to the mixed-multipoint tangential interpolation of the full-order linear- and quadratic-output transfer functions, and generalize the Meier-Luenberger optimality framework for the $\mathcal{H}_2$-optimal model reduction of linear time-invariant systems. Second, given the optimal interpolation data, we show how to enforce the interpolatory optimality conditions explicitly by Petrov-Galerkin projection of the full-order model. Third, to find the optimal interpolation data, we build on this projection framework and propose a generalization of the iterative rational Krylov algorithm for the $\mathcal{H}_2$-optimal model reduction of linear quadratic-output systems, called LQO-IRKA. Upon convergence, LQO-IRKA produces reduced linear quadratic-output systems that satisfy the interpolatory optimality conditions. The method only requires solving shifted linear systems and matrix-vector products, thus making it suitable for large-scale problems. Numerical examples are included to illustrate the effectiveness of the proposed method.
$H_2$ optimal model reduction of linear systems with multiple quadratic outputs
In this work, we consider the $H_2$ optimal model reduction of dynamical systems that are linear in the state equation and up to quadratic nonlinearity in the output equation. As our primary theoretical contributions, we derive gradients of the squared $H_2$ system error with respect to the reduced model quantities and, from the stationary points of these gradients, introduce Gramian-based first-order necessary conditions for the $H_2$ optimal approximation of a linear quadratic output (LQO) system. The resulting $H_2$ optimality framework neatly generalizes the analogous Gramian-based optimality framework for purely linear systems. Computationally, we show how to enforce the necessary optimality conditions using Petrov-Galerkin projection; the corresponding projection matrices are obtained from a pair of Sylvester equations. Based on this result, we propose an iteratively corrected algorithm for the $H_2$ model reduction of LQO systems, which we refer to as LQO-TSIA (linear quadratic output two-sided iteration algorithm). Numerical examples are included to illustrate the effectiveness of the proposed computational method against other existing approaches.
comment: 21 pages, 3 figures
Constrained Control of PDE Traffic Flow via Spatial Control Barrier Functions
In this paper, a constrained control approach to variable speed limit (VSL) control for macroscopic partial differential equations (PDE) traffic models is developed. Control Lyapunov function (CLF) theory for ordinary differential equations (ODE) is extended to account for spatially and temporally varying states and control inputs. The stabilizing CLF is then unified with safety constraints through the introduction of spatially varying control barrier functions (sCBF). These methods are applied to in-domain VSL control of the Lighthill-Whitham-Richards (LWR) model to regulate traffic density to a desired profile while ensuring the density remains below prescribed limits enforced by the sCBF. Results show that incorporating constrained control minimally affects the stabilizing control input while successfully maintaining the density with the defined safe set.
comment: Accepted to 2026 European Control Conference, 6 pages, 7 figures
Smooth Sampling-Based Model Predictive Control Using Deterministic Samples
Sampling-based model predictive control (MPC) is effective for nonlinear systems but often produces non-smooth control inputs due to random sampling. To address this issue, we extend the model predictive path integral (MPPI) framework with deterministic sampling and improvements from cross-entropy method (CEM)--MPC, such as iterative optimization, proposing deterministic sampling MPPI (dsMPPI). This combination leverages the exponential weighting of MPPI alongside the efficiency of deterministic samples. Experiments demonstrate that dsMPPI achieves smoother trajectories compared to state-of-the-art methods.
comment: To be published in the Proceedings of the 23rd IFAC World Congress (IFAC 2026)
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.
comment: Add an accelerated variant of the proposed method. New proofs of proposed methods
Random Access for LEO Satellite Communication Systems via Deep Learning
Integrating contention-based random access procedures into low Earth orbit (LEO) satellite communication (SatCom) systems poses new challenges, including long propagation delays, large Doppler shifts, and a large number of simultaneous access attempts. These factors degrade the efficiency and responsiveness of conventional random access schemes, particularly in scenarios such as satellite-based internet of things and direct-to-device services. In this paper, we propose a deep learning-based random access framework designed for LEO SatCom systems. The framework incorporates an early preamble collision classifier that uses multi-antenna correlation features and a lightweight 1D convolutional neural network to estimate the number of collided users at the earliest stage. Based on this estimate, we introduce an opportunistic transmission scheme that balances access probability and resource efficiency to improve success rates and reduce delay. Simulation results under 3GPP-compliant LEO settings confirm that the proposed framework achieves higher access success probability, lower delay, better physical uplink shared channel utilization, and reduced computational complexity compared to existing schemes.
comment: 13 pages, 13 figures, 5 tables
Scheduling Analysis of UAV Flight Control Workloads on PREEMPT_RT Linux Using a Raspberry Pi 5
Modern UAV architectures increasingly aim to unify high-level autonomy and low-level flight control on a single General-Purpose Operating System (GPOS). However, complex multi-core System-on-Chips (SoCs) introduce significant timing indeterminism due to shared resource contention. This paper performs an architectural analysis of the PREEMPT RT Linux kernel on a Raspberry Pi 5, specifically isolating the impact of kernel activation paths (deferred execution SoftIRQs versus real-time direct activation) on a 250 Hz control loop. Results show that under heavy stress, the standard kernel is unsuitable, exhibiting worst-case latencies exceeding 9 ms. In contrast, PREEMPT RT reduced the worst-case latency by nearly 88 percent to under 225 microseconds, enforcing a direct wake-up path that mitigates OS noise. These findings demonstrate that while PREEMPT RT resolves scheduling variance, the residual jitter on modern SoCs is primarily driven by hardware memory contention.
comment: 9 pages, 8 figures, conference
Distributed Fusion Estimation with Protecting Exogenous Inputs
In the context of distributed fusion estimation, directly transmitting local estimates to the fusion center may cause a privacy leakage concerning exogenous inputs. Thus, it is crucial to protect exogenous inputs against full eavesdropping while achieving distributed fusion estimation. To address this issue, a noise injection strategy is provided by injecting mutually independent noises into the local estimates transmitted to the fusion center. To determine the covariance matrices of the injected noises, a constrained minimization problem is constructed by minimizing the sum of mean square errors of the local estimates while ensuring (ε, δ)-differential privacy. Suffering from the non-convexity of the minimization problem, an approach of relaxation is proposed, which efficiently solves the minimization problem without sacrificing differential privacy level. Then, a differentially private distributed fusion estimation algorithm based on the covariance intersection approach is developed. Further, by introducing a feedback mechanism, the fusion estimation accuracy is enhanced on the premise of the same (ε, δ)-differential privacy. Finally, an illustrative example is provided to demonstrate the effectiveness of the proposed algorithms, and the trade-off between differential privacy level and fusion estimation accuracy.
Distributed Circumnavigation Using Bearing Based Control with Limited Target Information
In this paper, we address the problem of circumnavigation of a stationary target by a heterogeneous group comprising of $\textbf{n}$ autonomous agents, having unicycle kinematics. The agents are assumed to have constant linear speeds, we control only the angular speeds. Assuming limited sensing capabilities of the agents, only a subset of agents, termed as \textit{leaders}, know the target location. The rest, termed as \textit{followers}, do not. We propose a distributed guidance law which drives all the agents towards the desired objective; global asymptotic stability (GAS) is ensured by using Zubov's theorem. The efficacy of the approach is demonstrated through both numerical simulations and hardware experiments.
comment: 6 pages, 17 figures
Data-Driven Stochastic Control: Foundations and Guarantees
This work establishes a step forward in advancing data-driven trajectory-based methods for stochastic systems with unknown mathematical dynamics. In contrast to scenario-based approaches that rely on independent and identically distributed (i.i.d.) trajectories, this work develops a data-driven framework where each trajectory is gathered over a finite horizon and exhibits temporal dependence, referred to as a non-i.i.d. trajectory. To ensure safety of dynamical systems using such trajectories, the current body of literature primarily considers dynamics subject to unknown-but-bounded disturbances, which facilitates robust analysis. While promising, such bounds may be violated in practice and the resulting worst-case robust analysis tends to be overly conservative. To overcome these key challenges, this paper considers stochastic systems with unknown mathematical dynamics, influenced by process noise with arbitrary distributions. In the proposed framework, data is collected from stochastic systems under multiple realizations within a finite-horizon experiment, where each realization generates a non-i.i.d. trajectory. Leveraging the concept of stochastic control barrier certificates constructed from data, this work quantifies probabilistic safety guarantees with a certified confidence level. To achieve this, the proposed conditions are formulated as a sum-of-squares (SOS) optimization problem, relying solely on empirical average of the collected trajectories and statistical features of the process noise. The efficacy of the approach has been validated on three stochastic benchmarks with unknown models and arbitrary noise distributions. In one case study, it is shown that while no safety controller exists for the robust analysis of the system under bounded disturbances, the proposed stochastic framework yields a safety controller together with quantified probabilistic safety guarantees.
Learning Power Flow with Confidence: A Probabilistic Guarantee Framework for Voltage Risk
The absence of formal performance guarantees in machine learning (ML) has limited its adoption for safety-critical power system applications, where confidence and interpretability are as vital as accuracy. In this work, we present a probabilistic guarantee for power flow learning and voltage risk estimation, derived through the framework of Gaussian Process (GP) regression. Specifically, we establish a bound on the expected estimation error that connects the GP's predictive variance to confidence in voltage risk estimates, ensuring statistical equivalence with Monte Carlo-based ACPF risk quantification. To enhance model learnability in the low-data regime, we first design the Vertex-Degree Kernel (VDK), a topology-aware additive kernel that decomposes voltage-load interactions into local neighborhoods for efficient large-scale learning. Building on this, we introduce a network-swipe active learning (AL) algorithm that adaptively samples informative operating points and provides a principled stopping criterion without requiring out-of-sample validation. Together, these developments mitigate the principal bottleneck of ML-based power flow, its lack of guaranteed reliability, by combining data efficiency with analytical assurance. Empirical evaluations across IEEE 118-, 500-, and 1354-bus systems confirm that the proposed VDK-GP achieves mean absolute voltage errors below 1E-03 p.u., reproduces Monte Carlo-level voltage risk estimates with 15x fewer ACPF computations, and achieves over 120x reduction in evaluation time while conservatively bounding violation probabilities.
comment: 10 pages
Secure RSMA-based Visible Light Networks under Spatial Correlation
This paper investigates the secrecy sum rate (SSR) of rate-splitting multiple access (RSMA)-based visible light communication (VLC) systems considering internal eavesdropping, where legitimate users may intercept private data intended for others. We formulate an optimization problem to maximize the SSR of the system, which is inherently non-convex due to the complex coupling of the objective function and constraints. To this end, two different approaches based on the convex-concave procedure (CCCP) and semidefinite relaxation (SDR) are leveraged to solve the non-convex parameterized problem. A central focus of this work is the investigation of channel similarity (CS), which serves as a metric for quantifying spatial correlation, and its impact on SSR performance. To mitigate the performance degradation caused by high spatial correlation, we propose a channel similarity reduction (CSR) clustering strategy that proactively minimizes CS to restore the system's degrees of freedom (DoF). Numerical results are provided to demonstrate the performance of the two proposed algorithms under various levels of CS. More importantly, the findings reveal that our proposed CSR-clustering strategy significantly outperforms existing baselines, effectively overcoming the secrecy performance ceiling caused by high spatial correlation.
BigDipper: Sharded Censorship Resistant Data Availability for Leader-Based BFT
Leader-based Byzantine-fault-tolerant (BFT) protocols provide low latency and simple communication structure, but they give the leader short-term control over transaction inclusion. A malicious leader can keep the protocol live while delaying or excluding time-sensitive transactions such as auction bids, oracle updates, liquidations, and bridge messages. Existing responses often build a fixed censorship-resistance, hiding, or ordering mechanism into the protocol path, forcing all transactions to pay for the same protection level. name follows the end-to-end principle: the consensus layer exposes inclusion primitives rather than hardcoding stronger policies. Higher-layer protocols can then choose their own submission strategies and resources, whether through replication, erasure coding, or other mechanisms, to obtain the censorship-resistance, hiding, ordering, or execution guarantees they need. At the core of BigDipper is censorship-resistant data availability, or DA-CR, which certifies available replica-contributed mini-blocks for use by leader-based consensus. A central design goal is that data remains sharded on the consensus critical path: validators do not reconstruct or execute the full payload before voting, but instead check commitments, availability evidence, and the DA-CR inclusion rule. We define DA-CR guarantees for data-tampering resistance, honest mini-block inclusion, and residual leader influence. We then give concrete constructions based on erasure coding and linear commitments, analyze client-tunable transaction submission, and instantiate BigDipper inside HotStuff-2.
R2DN: Scalable Parameterization of Contracting and Lipschitz Recurrent Deep Networks
This paper presents the Robust Recurrent Deep Network (R2DN), a scalable parameterization of robust recurrent neural networks for machine learning and data-driven control. We construct R2DNs as the feedback interconnection of a linear time-invariant system and a 1-Lipschitz deep feedforward network, and directly parameterize the weights so that our models are stable (contracting) and robust to small input perturbations (Lipschitz) by design. Our parameterization uses a structure similar to the previously-proposed recurrent equilibrium network (REN), but without the requirement to iteratively solve an equilibrium layer at each time-step. This speeds up both model inference and backpropagation on GPUs, and makes it computationally feasible to scale up the network size, batch size, and input sequence length in comparison to RENs. We compare R2DNs to RENs on three representative problems in nonlinear system identification, observer design, and learning-based feedback control. We find that training and inference are both up to an order of magnitude faster with similar test set performance, and that they scale more favorably with respect to model expressivity.
Robotics
TIDES: Time-Derivative Event Simulation via Deformable Reconstruction
Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.
World-Task Factorization for Robot Learning
Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.
Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions
Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.
comment: 6 pages, 4 figures, accepted at MIPRO 2026
WALL-WM: Carving World Action Modeling at the Event Joints
WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
Co-training with Ego-centric Video and Demonstration for Robot Navigation Task
Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.
Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects
World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.
Closed-Form Pose Estimation of Endoluminal Medical Devices via Gradiometer-Based Electromagnetic Localization System
Embedded magnetic tracking holds highly attractive prospects for remote navigation of endoluminal medical devices. However, existing six-degree-of-freedom pose recovery approaches often require pre-calibrated workspace field maps or iterative nonlinear optimization. This letter presents a Gradiometer-Based Electromagnetic Localization System (GELS), a closed-form tracking framework that uses a compact magnetometer array as an embedded quasi-gradiometer to estimate local magnetic fields and gradient tensors. These quantities are mapped by the Euler homogeneous relation to displacements between source and array, from which multi-source Procrustes registration recovers the array orientation and position using at least three non-collinear sources. The algorithm requires known source positions and array geometry, but no pre-calibrated workspace field maps, initial pose guesses, or calibrated excitation-source moments. The recovered pose also enables a proof-of-concept sub-level dipole localization task by serving as a mobile magnetic reference frame. Benchtop experiments across sensor-array configurations and excitation modes demonstrate sequence-averaged position errors of \SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}, a fastest update rate of \SI{14.49}{\hertz}, and a median solver runtime of \SI{172.00}{\micro\second}. A perturbation-based error propagation analysis further identifies inter-sensor inconsistency and dipole-model mismatch as the dominant accuracy limits, thereby informing future sensor array and magnetic source design for further reducing pose-estimation error.
Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections
Diffusion policies have recently emerged as a powerful framework for robotic manipulation. However, like other behavior cloning methods, they remain vulnerable to distributional shift, often requiring human-in-the-loop interventions to correct failures during deployment. These interactions naturally provide paired supervision in the form of the robot's undesired actions and the human teacher's corrective actions. Yet existing data aggregation pipelines and standard behavior cloning losses largely ignore this negative signal from undesired actions, leading to overfitting to teacher's actions and an increasing reliance on costly expert data. To address this limitation, we propose Set-Supervised Diffusion Policy (SDP), a novel learning framework that utilizes contrastive action-chunk data to train diffusion policies from human corrections. From paired positive and negative action-chunks, SDP constructs a set of desired action-chunks and designs a training pipeline that encourages the diffusion policy to align with the set. Through extensive experiments across multiple robotic manipulation tasks, we demonstrate that SDP consistently improves policy performance, with particularly strong gains in robustness to noisy data. Moreover, SDP induces high-quality aggregated datasets, enabling more efficient and reliable policy learning from human-in-the-loop corrections. Our code is available at https://set-supervised-diffusion-policy.github.io/.
PHASOR: Phase-Anchored Universal Action Representations for Humanoid Embodiments
Learning a good action embedding space is fundamental to scalable robot policy learning, yet existing methods treat action latents as task-specific intermediates rather than first-class representations. The resulting latents are unstructured, embodiment-specific, and weakly tied to motion semantics, limiting interpretability, controllability, and transferability across robots. We position the action embedding space itself as a first-class design target, with downstream policy quality emerging from representation quality. Exploiting motion's intrinsic periodicity, we factorize it into a phase manifold that captures cyclic structure via FFT-parametric coefficients, together with a pose branch that conditions the manifold on non-periodic configuration detail. Combined with motion-semantic distillation, this factorized structure yields a cross-embodiment motion manifold that is interpretable and embodiment-agnostic by design. Anchoring multiple humanoid robots to a shared human-pretrained manifold then produces a unified action embedding space across diverse platforms, achieving strong cross-embodiment retrieval and consistent gains on downstream robot tasks.
The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space ICML 2026
Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.
comment: ICML 2026 Accepted
DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction
We present \emph{DisFlow}, a novel framework for online scene flow estimation from distance field that enables \emph{6DoF dynamic object pose estimation}, \emph{motion tracking}, and \emph{surface reconstruction}. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the \emph{object frame}, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces. Code publicly available at \href{https://github.com/LanWu076/disflow_ros2}{https://github.com/LanWu076/disflow\_ros2}
Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality
Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.
FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds
This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.
comment: 5 pages, 1 figure, technical report
FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects
We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation
Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.
comment: 8 pages
Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control
We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.
comment: Project: https://huiqiongli.github.io/RoboTrustBench/
Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms
Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro--macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection--diffusion--reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning--PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions -- including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue -- we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation NeurIPS 2025
Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
comment: Published in NeurIPS 2025, address some typos
Hierarchical Object Representation for Spatial Robot Perception: Points, Meshes, and Superquadrics
Hierarchical 3D Scene Graphs (3DSG) have emerged as an actionable and scalable representation for long-term autonomy incorporating metric, semantic, and topological information in the scene. However, the question of geometric representation of objects in 3DSG has been overlooked as most methods use simplified geometric models such as partial point clouds or 3D bounding boxes. In this work, we introduce a hierarchical object representation that can be leveraged for high-fidelity object-level reconstruction, object-based robust re-localization or map alignment, and efficient and analytical collision checking for safe robot navigation planning in dense and cluttered environments. The representation is structurally organized into four distinct layers, progressively abstracting the scene from raw sensor data to dense 3D meshes to analytical primitives such as superquadrics, which provide a sparse and analytical representation for object geometry. We develop a pipeline that builds the hierarchical object representation from RGB-D image stream captured by a robot, and demonstrate its working in real-world open-set object scenes in both indoor and outdoor environments. Extensive experiments across diverse datasets including HOPE, ReplicaCAD, Kimera-Multi, and NUS Campus Dataset collected using Unitree B2 Robot validate our pipeline in both indoor and outdoor environments. We show that our superquadric-based map alignment method outperforms the current state-of-the-art object based map alignment method ROMAN. Our code can be found at https://github.com/perceptica-robotics/Hickory.
comment: 18 pages, 5 figures, 4 tables
Spatio-Temporal Reconnection for Multi-Robot Networks using Adaptive Prescribed-Time CBFs
In multi-robot systems, maintaining persistent communication graph connectivity is often overly restrictive, especially when robots have limited communication ranges but operate in large environments. Instead, allowing robots to temporarily disconnect and later reconnect is often more desirable for efficient task execution while still ensuring timely information sharing across the team. In this paper, we propose an adaptive prescribed-time control barrier function (adaptive PT-CBF) framework that enables robots to temporarily disconnect and re-enter the communication range within an adjustable and feasible prescribed time. Moreover, we introduce a reconnection triggering mechanism that jointly considers task execution and reconnection urgency, thereby providing a principled way to decide when reconnection should occur. Theoretical analysis justifies convergence to the satisfying reconnection within a prescribed finite time. Experimental results validate the performance of our proposed adaptive PT-CBF with improved task efficiency and satisfying reconnections.
comment: 6 pages, 6 figures, accepted by IFAC 2026
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.
SPARC: Spatial-Aware Path Planning via Attentive Agent Communication
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making
comment: The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme
Magnetic Indoor Localization through CNN Regression and Rotation Invariance
Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.
comment: Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)
Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation
Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive "generate-then-filter" pipeline that relies on auxiliary selectors during inference. To address these challenges, we propose Self-Imitated Diffusion Policy (SIDP), a novel framework that learns improved planning by selectively imitating a set of trajectories sampled from itself. Specifically, SIDP introduces a reward-guided self-imitation mechanism that encourages the policy to consistently produce high-quality trajectories efficiently, rather than outputs of inconsistent quality, thereby reducing reliance on extensive sampling and post-filtering. During training, we employ a reward-driven curriculum learning paradigm to mitigate inefficient data utility, and goal-agnostic exploration for trajectory augmentation to improve planning robustness. Extensive evaluations on a comprehensive simulation benchmark show that SIDP significantly outperforms previous methods, with real-world experiments confirming its effectiveness across multiple robotic platforms. On Jetson Orin Nano, SIDP delivers a 2.5$\times$ faster inference than the baseline NavDP, i.e., 110ms VS 273ms, enabling efficient real-time deployment.
comment: Preprint
Motion-aware Event Suppression for Event Cameras
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
comment: Robotics: Science and Systems (RSS) 2026
AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation SIGGRAPH 2026
Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.
comment: 16 pages, SIGGRAPH 2026
CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots
Animals in nature combine multiple modalities, such as sight and feel, to perceive terrain and develop an understanding of how to walk on uneven terrain in an efficient manner. Similarly, legged robots need to develop their ability to stably walk on complex terrains by developing an understanding of the relationship between vision and proprioception. Most current terrain-adaptation methods remain susceptible to failure on complex off-road terrain because they do not explicitly model the context between exteroceptive terrain appearance and proprioceptive physical interaction. This experience-based learning often creates a Visual-Texture Paradox between what has been seen and how it actually feels. In this work, we introduce CART, a high-level controller built on a context-aware terrain adaptation approach that integrates proprioception and exteroception from onboard sensing to achieve a robust understanding of terrain. We evaluate our method on multiple terrains using the Unitree Go2 and ANYmal-C robot on the IsaacSim simulator and a Boston Dynamics SPOT robot for our real-world experiments. To evaluate whether the learned context improves locomotion behavior under the various paradox circumstances, we measure the robot s stability, traversal success, and task completion time in both simulation and real-world experiments. We compare CART against state-of-the-art locomotion and terrain- adaptation baselines across diverse terrain conditions. CART improves the average success rate by 5% over the baselines in simulation, while improving context-conditioned locomotion behavior, including up to 41% lower base oscillation in simulation and 22% in the real world, without increasing the time required to complete the locomotion tasks.
URDF-Anything+: End-to-End Generation for Simulation-Ready Articulated Assets
Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, recovering them from visual observations is inherently challenging, as images provide only partial and ambiguous cues about both part geometry and their underlying kinematic structure. Existing approaches typically rely on multi-stage pipelines, retrieval from asset libraries, or explicit part segmentation. We present URDF-Anything+, an end-to-end autoregressive diffusion framework that generates simulation-ready URDF models directly from a single RGB image. Conditioned on visual observations and object geometry, URDF-Anything+ operates in a structured latent space and jointly models part geometry and articulation in a unified generation process. Specifically, the model sequentially predicts each articulated part together with its associated joint parameters, while a termination token dynamically determines the number of parts. This design enables direct generation of fully executable URDFs without external retrieval or post-processing stages. Experiments on large-scale articulated object benchmarks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter estimation, and physical executability, while being substantially more efficient than existing multi-stage approaches. Furthermore, the generated URDFs serve as faithful digital twins, enabling the zero-shot transfer of manipulation policies trained purely in simulation.
LAP: Fast LAtent Diffusion Planner for Autonomous Driving
Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. To bridge the representational gap between the high-level semantic planning space and the vectorized scene context, we introduce an intermediate feature alignment mechanism that facilitates robust information fusion. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10x over previous SOTA approaches.
Wall-OSS-0.5 Technical Report
Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.
comment: 15 pages, accepted to Robotics: Science and Systems (RSS) 2026
BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric Antennas
We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. We outline the design, simulation, fabrication, and integration of the proposed system on low-power embedded platforms, focusing on portable and scalable applications. For performance evaluation, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Ocean trials demonstrate that BlueME maintains reliable signal transmission at distances beyond 700 meters while consuming only 10 watts of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference -- conditions that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.
An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays
In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.
comment: 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice
Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning
Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across diverse lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.
comment: Accepted at RLC 2026
NestRL: A Nested Training Regime for Mutual Adaptation in Human-AI Teaming
Mutual adaptation is a central challenge in human-AI teaming, as humans naturally adjust their strategies in response to an AI agent's behavior. Existing approaches attempt to approximate human behavior by diversifying training partners; however, these partners are typically static and fail to capture the adaptive nature of human teammates. When agents are trained jointly in standard multi-agent settings, they often converge to opaque coordination strategies that work only with their co-trained partners, leading to poor generalization. To model adaptive human behavior, we formulate human-AI teaming as an Interactive Partially Observable Markov Decision Process (I-POMDP). We propose NestRL, a nested training regime that learns the solution to a finite-level I-POMDP by training agents at each level against adaptive agents from the level below. This exposes agents to adaptive behavior while preventing emergence of opaque coordination strategies. We provide theoretical analysis showing that NestRL agents avoid convergence to partner-specific strategies, and validate this empirically in the Overcooked domain against state-of-the-art baselines. NestRL achieves higher task performance with both unseen adaptive agents and real human teammates, while exhibiting significantly greater adaptability over the course of interaction.
Multiagent Systems
Agentic-J: An AI Agent for Biological Microscopy Image Analysis
Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that few researchers can command simultaneously. We present Agentic-J, a containerised, multi-agent AI assistant, primarily for ImageJ/Fiji that enables biologists to specify analysis tasks in natural language, from nuclei segmentation and cell tracking to multi-condition quantification. The agent generates executable scripts organised into a documented project structure, so every analysis decision is traceable and the workflow can be reproduced or shared. The specialised sub-agents handle plugin management, code generation, debugging, quality assurance, and statistical reporting. In this paper we introduce the system's design, demonstrate real biological microscopy image analysis workflows, and detailed the technical implementation.
comment: Presented at Cell Biology at Scale 2026 (Poster). The Agentic-J project is available at https://mmv-lab.github.io/Agentic-J/
World-Task Factorization for Robot Learning
Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.
A Simple Hierarchical Causality Primer
We provide a brief primer for the idea behind formalising hierarchical causality in the context of complex systems. Here actors are not simply agents. Actors instantiate causation classes. Agents implement local dynamics in given levels or organisation in a given system. Hierarchical causality then describes how actor-level roles constrain, select, and organise agent-level behaviour across levels. The system then necessarily requires three additional structures. First, causation classes to abstract a given form of causal influence that an actor instantiates. Second, aggregation operators to move across the levels. Third, discrete event-time maps are required because the system comprises events, and the relation between local event counts and any global clock must be specified. Our formulation here is purposefully simple and discrete.
comment: 8 pages, 1 figure; short technical primer with a toy example in an appendix
Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions
Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.
comment: 6 pages, 4 figures, accepted at MIPRO 2026
QoEReasoner: An Agentic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs
Diagnosing Quality-of-Experience (QoE) degradations in operational Radio Access Networks (RANs) is a critical but notoriously complex task, traditionally requiring labor-intensive expert analysis over high-dimensional, cross-layer telemetry. While Large Language Models (LLMs) offer unprecedented reasoning capabilities, they are fundamentally unsuited for raw RANs troubleshooting: they fail at numeric time-series analysis, hallucinate protocol-violating causal links, and lack the stateful rigor required for multi-step fault localization. To bridge this gap, we present QoEReasoner, an end-to-end, LLM-driven agentic system designed for automated and explainable QoE diagnosis. QoEReasoner tames the inherent unpredictability of LLMs by grounding their reasoning in the physical realities of the network. It employs deterministic tools to reliably translate raw numeric KPIs into structured evidence, enforces protocol-consistent fault propagation through a domain-specific Knowledge Base, and leverages a Historical Bank of expert-validated cases to guide hypothesis generation. A stateful central planner orchestrates this closed-loop process across anomaly detection, causal tracing, and root-cause localization. Evaluations on real-world operational RANs datasets demonstrate that QoEReasoner outperforms strong baselines by 18\%-40\% in accuracy across multiple diagnostic tasks. Furthermore, it reduces diagnostic time from approximately 30 minutes of manual expert analysis to just 3 minutes per session, delivering highly interpretable, expert-grade reports while remaining robust across diverse LLM backbones.
RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation
Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as it requires intricate knowledge of physical layer details and presents immense implementation challenges. Large Language Models (LLMs) and multi-agent systems have revolutionized conventional software engineering, raising the compelling question of whether they can resolve these formidable difficulties. However, our investigations reveal that current models experience significant limitations and fail to accomplish this task when applied to radio signal generation. This performance degradation primarily stems from severe domain ignorance and a fundamental insensitivity to physical hardware constraints. To bridge this gap, we introduce RadioMaster, a fully autonomous multi-agent framework designed to seamlessly translate user input into real-world wireless emissions. RadioMaster operates on three synergistic pillars: RadioWiki for domain-specific knowledge retrieval, RadioAgent for collaborative I/Q sample generation alongside hardware configuration, and RadioEmulator for closed-loop physical layer verification. Furthermore, we construct RadioBench, the first comprehensive benchmark tailored specifically for the radio signal generation domain. Extensive real-world evaluations demonstrate that RadioMaster significantly outperforms state-of-the-art (SOTA) baselines regarding configuration viability and signal fidelity.
From Global Policies to Local Strategies: Multi-Objective Optimization of Resource-Specific Handover Policies
Efficient resource allocation is a key challenge in business process management, with direct implications for cost, throughput time, and utilization. While recent Reinforcement Learning (RL) approaches have shown promise in deriving adaptive allocation policies, they typically neglect inter-resource collaboration patterns that can strongly influence real-world task handovers. Recognizing this, this paper introduces the first approach for multi-objective optimization of resource-level decision-making, enabling the recommendation of person-specific handover policies. To achieve this, our work combines an existing Multi-Agent System-based process simulator with a multi-objective evolutionary algorithm. The resulting approach produces Pareto-optimal, resource-specific policies that optimize the process across multiple objectives. Experimental results on synthetic and real-world datasets show that our approach reduces costs by an average of 37% and waiting time by 58%, consistently outperforming heuristic baselines and demonstrating the potential of leveraging collaboration-aware optimization to improve process performance.
Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus
Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.
comment: 11 pages, 3 figures, 5 tables
MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand
Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induced errors. We propose MetaForge, a multimodal agent framework that learns when to invoke tools and how to evolve its toolset on demand. MetaForge factorizes agentic behavior into four coupled stages: Decide (judging whether tool use is warranted), Retrieve (selecting suitable tools), Adapt (grounding tool parameters in task context), and Forge (synthesizing new skills online and recycling them into the tool library for reuse), forming a closed judge-retrieve-adapt-forge-recycle loop. A unified orchestration policy enables the agent to choose among answering directly, reusing existing tools, or forging new ones. We jointly optimize invocation necessity, retrieval accuracy, execution effectiveness, and forged-skill reusability via reinforcement learning, with an explicit invocation-cost penalty discouraging redundant calls. Across 12 benchmarks, MetaForge consistently surpasses 16 baselines in accuracy, efficiency, and generalization, validating a paradigm shift from static tool inventories to on-demand self-evolution.
A Sheaf Framework for Strategic Multi-Agent Systems: From Consensus to Nash Equilibria
The coordination of heterogeneous autonomous agents in dynamic, adversarial environments requires simultaneous satisfaction of geometric constraints, logical consistency, temporal reasoning, and strategic optimization. Existing sheaf- and topos-theoretic frameworks provide powerful tools for geometric consensus, knowledge alignment, and causal planning, but lack explicit models for value, reward, and strategic choice. This report presents a unified categorical framework that integrates event calculus, SCEL-like ensemble formation, and game-theoretic reward structures into a single Grothendieck topos of time-space histories. We introduce the notion of a \emph{game sheaf} whose stalks contain utility functions and policy distributions, and restriction maps encode both parallel transport and best-response dynamics. We prove that Nash equilibria correspond to global sections of a derived best-response correspondence sheaf, while cohomological obstructions classify failures of strategic consistency. A detailed case study of an immunological ``bastion defense'' scenario -- heterogeneous agents forming attack/defense ensembles under resource constraints -- demonstrates the framework's expressiveness. This synthesis provides a rigorous foundation for verifiable, autonomic, and economically rational multi-agent systems.
TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning
This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated over a curated corpus of approximately 2,100 academic papers in intelligent tires, vehicle dynamics, and vehicle control. Unlike conventional single-pass RAG systems, the proposed architecture employs a 13-step autonomous pipeline that classifies queries by intent, scores evidence sufficiency against a multi-dimensional rubric, performs agentic retry with drift-guarded query reformulation, searches external academic databases (Crossref, OpenAlex, Semantic Scholar) through iterative optimize--search--vet loops, traverses a Neo4j knowledge graph for relational context, verifies citation integrity, and applies post-generation quality checks with automatic regeneration. Key contributions include a 100-point evidence sufficiency scoring framework across five dimensions with relevance damping and hybrid rule-based/LLM review; a route-dependent external search architecture with iterative agentic loops; a knowledge graph constructed via LLM-based entity extraction and OpenAlex author validation with intra-corpus citation resolution; and a self-correcting generation loop with citation verification and quality assessment. The framework is presented as a practical, implemented case study illustrating how agentic, evidence-grounded RAG can support literature navigation and technical reasoning over large, domain-specific corpora.
Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms
Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro--macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection--diffusion--reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning--PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions -- including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue -- we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.
Agent System Operations: Categorization, Challenges, and Future Directions
As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.
Multi-Agent Computer Use
Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by $3.4-25.5\%$ on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by ${\sim} 1.5 \times$, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.
Terminal Time and Angle-Constrained Nonlinear Intercept Guidance
This paper considers the problem of simultaneously controlling an interceptor's impact time and impact angle using its lateral acceleration as the sole control input. With a single control input, the nonlinear engagement kinematics is inherently underactuated, which complicates guidance law synthesis. To overcome this challenge, a hierarchical sliding mode-based guidance law is developed to concurrently regulate the two terminal constraints. The proposed architecture consists of a two-layer sliding manifold. The first layer comprises two sub-sliding surfaces corresponding to the impact time and impact angle error dynamics, respectively, while the second layer introduces a composite sliding manifold that combines the two individual sub-surfaces. Then, a variable-gain adaptive guidance law is designed to ensure time and angle-constrained interception against a stationary target, which is further extended to intercept a constant velocity target. Simulations are conducted for various engagement scenarios to attest to the efficacy of the proposed approach.
The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models
Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($β= 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.
comment: Submitted to American Journal of Epidemiology
When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.
comment: 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics
Toward a Modular Architecture for Embedded AI Agent Systems at the Edge
The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions
How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.
Self-Regulation through Communication in Evolved Neural Agents
Communication is typically understood as indication: signals that transfer information from sender to receiver. We present a minimal predator avoidance task in which pairs of evolved CTRNN agents use communication for robust survival, and in which agents hear their own vocalizations, as in natural systems. Across 112 perfect-fitness agents from over 2,000 evolutionary runs, three dominant strategies emerge (accounting for 81% of agents): safety calling (39%), where agents signal from safe cover; alarm indication (22%), where agents vocalize when a threat is present without relying on self-hearing; and self-regulatory calling (20%), where agents depend on hearing their own call to sustain escape behavior. Self-hearing dependency is common among agents that call during an active threat (47%), but rare among agents that call only after reaching safe cover (10%; p < 10^-4). This pattern is consistent with a difference in causal order: safety callers act then communicate, while self-regulatory callers communicate in order to act. Removing self-hearing selectively impairs self-regulatory callers (fitness 0.40) while safety callers remain functional (0.90; p < 10^-9). These results show that communication can evolve to serve the caller's own behavioral regulation, not just information transfer to others.
comment: 7 pages, 5 figures. Submitted to ALIFE 2026
Democracy on Rugged Landscapes: Phase Transitions in Optimal Voting Rules
Laws and institutions shape individual outcomes through complex interactions with citizens' diverse circumstances, yet how different voting methods navigate this coupled landscape remains poorly understood. We model collective governance as optimization on NK fitness landscapes, where shared bits (laws) are updated by voting while individual bits (personal traits) remain fixed. A cross-dependency parameter $α$ controls how legislation's effects depend on individual circumstances. We compare eight standard voting methods and a generalized scoring family across landscape ruggedness $K \in \{1,\ldots,20\}$ and $α\in [0,1]$ with 1000 runs per configuration. Under direct democracy, the optimal voting method undergoes sharp phase transitions as a function of landscape complexity: cardinal score voting dominates on smooth landscapes, ordinal scoring with $p=0.35$ at low-to-moderate ruggedness, Borda count across a wide middle range, and STAR voting at the highest complexity. A two-parameter empirical formula reduces the $(K, α)$ plane to a single complexity axis for visualization. Borda count achieves the highest mean fitness and lowest variance across most of the parameter space. We further introduce a representative democracy model parameterized by identity weight $β$ and candidate self-interest $p_{\mathrm{self}}$. Representation reshapes the complexity-dependent structure even under favorable conditions: cardinal score voting dominates across most regimes, with plurality emerging as the top method at high $β$ and low-to-moderate $p_{\mathrm{self}}$.
comment: 8 pages, 3 figures. Submitted to ALIFE 2026
ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.
comment: 20 pages, 6 figures, 12 tables
A No-Regret Framework for Adaptive Incentive Design
Incentive design studies how a central authority can influence strategic agents through payments, subsidies, or taxes, so that individual objectives align with collective welfare. This paper introduces a No-Regret Adaptive Incentive Design (RAID) framework for nonlinear games with continuous action spaces and private agent costs. In this framework, the authority (planner) designs incentives that regulate the Nash equilibrium toward a socially optimal action profile, while simultaneously learning agents' unknown preferences from repeated strategic responses. We formulate the RAID problem and construct a least-squares estimator whose strong consistency requires only diminishing excitation. Leveraging this weak excitation requirement, we propose a switching incentive policy that alternates between probing (exploration) and estimate-based (exploitation) incentives. The resulting policy achieves an $O(t^{-0.5})$ parameter estimation rate and accumulates $O(t^{0.5}\log t)$ squared social-cost regret, almost surely. We further extend the framework to an endogenous-noise response model, where standard least-squares estimation is biased due to an error-in-variables correlation between the noise and agent responses. We utilize a repeated-sampling estimator and corresponding switching policy that retain the same almost-sure convergence and regret rates. Numerical experiments validate the effectiveness and predicted convergence rates of the method.
comment: 21 pages, 5 figures
ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning ACL 2026
The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.
comment: This paper has been accepted by Findings of ACL 2026
A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic decision framework that resolves this trade-off by formulating method selection as a two-player zero-sum game between a Monitor (selecting computational methods and parameters) and Nature (selecting the unknown traffic scenario). We construct an end-to-end pipeline from trajectory surveillance data through eight candidate detection algorithms, a Monte Carlo sensitivity analysis characterizing their stochastic performance, and finally a multi-objective optimization layer that identifies Pareto-optimal method portfolios. The minimax solution provides a robust mixed strategy with a probability distribution over methods that guarantees worst-case performance regardless of scenario uncertainty. Experimental evaluation across 200 randomized configurations spanning 5--50 aircraft demonstrates that the framework recommends distinct method portfolios depending on operational priority: Koopman Phase dominates balanced (70.6%) and speed-priority (79.7%) profiles, while CRQA emerges as primary (47.4%) when route-lead identification is prioritized. The framework achieves a guaranteed game value of 0.29--0.53 (normalized utility) across all tested preference profiles, providing the first principled, scenario-adaptive methodology for computational method selection in UTM fleet monitoring operations.
LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning ICML 2026
Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.
comment: 9 pages for main, 32 pages for total, Accepted to ICML 2026
The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
Bias in large language models (LLMs) remains a persistent challenge, often leading to stereotyping and unfair treatment across social groups. While prior work has mainly focused on individual LLMs, the emergence of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and underexplored dynamics in how bias emerges, propagates, and amplifies. To systematically investigate these dynamics, we propose a simple evaluation framework with three agent-level metrics that quantify bias emergence, propagation, and amplification throughout multi-agent interaction. We evaluate MAS across three bias benchmarks under varying LLM backbones, social-group configurations, communication behaviors, and adversarial settings. Our results show that communication can trigger up to 70\% new bias emergence, propagate bias across over 80\% of agents, and amplify stereotypes by more than 3$\times$. We further find that denser and competitive communication generally increases bias. Finally, we demonstrate that MAS are highly vulnerable to simple bias injection attacks, and existing defense strategies provide only limited protection. Our findings provide important insights into the fairness and robustness of multi-agent LLM systems.
NestRL: A Nested Training Regime for Mutual Adaptation in Human-AI Teaming
Mutual adaptation is a central challenge in human-AI teaming, as humans naturally adjust their strategies in response to an AI agent's behavior. Existing approaches attempt to approximate human behavior by diversifying training partners; however, these partners are typically static and fail to capture the adaptive nature of human teammates. When agents are trained jointly in standard multi-agent settings, they often converge to opaque coordination strategies that work only with their co-trained partners, leading to poor generalization. To model adaptive human behavior, we formulate human-AI teaming as an Interactive Partially Observable Markov Decision Process (I-POMDP). We propose NestRL, a nested training regime that learns the solution to a finite-level I-POMDP by training agents at each level against adaptive agents from the level below. This exposes agents to adaptive behavior while preventing emergence of opaque coordination strategies. We provide theoretical analysis showing that NestRL agents avoid convergence to partner-specific strategies, and validate this empirically in the Overcooked domain against state-of-the-art baselines. NestRL achieves higher task performance with both unseen adaptive agents and real human teammates, while exhibiting significantly greater adaptability over the course of interaction.
Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.
Delayed Repression and Emergent Instability in Adaptive Multi-Agent Systems
Regulatory institutions (from content moderation platforms to financial supervisors) observe, deliberate, and intervene only after a characteristic delay. We ask whether this processing lag alone can destabilize a multi-agent system that would otherwise remain stable, without exogenous shocks, coordination among agents, or malicious actors. We study this in two stages. First, we analyze a delayed replicator equation in which autonomous agents benefit from radical behavior but face punishment based on a lagged institutional alarm signal. We derive a closed-form critical delay beyond which the unique interior equilibrium loses stability through a Hopf bifurcation, and prove via center manifold reduction that the bifurcation is supercritical (bounded oscillations, not explosive growth) for the entire sigmoid response family. Second, we embed N=240 agents on a network with reinforcement learning (tabular Q-learning) and cross institutional delay with three decision architectures: fixed-policy, reactive (a memoryless threshold heuristic), and Q-learning. The hierarchy is opposite to the naive expectation that learning amplifies instability. Reactive agents are perfectly stable without delay yet collapse once delay is introduced (96% runaway by delay >= 8); fixed-policy agents are immune (0% at all delays); Q-learning agents are only partially resilient (66% at delay 20). The destabilizing ingredient is reactivity to delayed signals, not learning: agents that immediately exploit low-alarm windows trigger oscillatory feedback loops, while learning buffers this through punishment memory encoded in value functions. Throughout, "runaway" denotes bounded large-amplitude oscillation crossing a radical-fraction threshold, consistent with the supercritical bifurcation, not unbounded growth.
comment: 32 pages, 13 figures, 2 appendices. v2: corrected network parameterization; central result re-anchored on reactive agents; added robustness sweeps; bibliography fixes; structural and language edits. Code: https://github.com/YehudaItkin/delayed-repression-instability
Semantic knowledge guides innovation and drives cultural evolution
Cultural evolution allows ideas and technologies to accumulate across generations, reaching their most complex and open-ended form in humans. While social learning enables the transmission of such innovations, the cognitive processes that generate them remain poorly understood. Classical theories typically treat innovation as random variation, a simplification insufficient for explaining the complexity of human cultural evolution. We propose that semantic knowledge-the associations linking concepts to their properties and functions-guides human innovation and drives cumulative culture. To test this, we combined an agent-based model, which examines how semantic knowledge shapes cultural evolutionary dynamics, with a large-scale behavioral experiment (N = 1,243) testing its role in human innovation. Across both approaches, we found that semantic knowledge directed exploration toward meaningful solutions, enhanced innovation success, and enabled generalization from prior discoveries. Moreover, semantic knowledge interacted synergistically with social learning to amplify innovation and accelerate cumulative cultural change. In contrast, experimental participants lacking access to semantic knowledge performed no better than chance, even when social learning was possible, and relied on shallow exploration strategies for innovation. Together, these findings suggest that semantic knowledge is a key cognitive process underpinning human cumulative culture.
On Signed Network Games with Binary Actions
We study binary-action pairwise-separable graphical games that encompass both coordination and anti-coordination network games. Our model is grounded in an underlying directed signed graph, where each link is associated with a signed weight that describes both nature and the strength of the strategic pairwise interaction. Specifically, positive link weight corresponds to a strategic complement type interaction, whereas negative link weight corresponds to strategic substitute type interaction. The utility for each player is then an aggregation of pairwise terms determined by the weights of the signed graph in addition to an individual bias term. We consider a scenario that assumes the presence of a prominent cohesive subset of players, who are either connected exclusively by positive weights, or form a structurally balanced subset that can be bipartitioned into two adversarial subcommunities with positive intra-community and negative inter-community edges. Under suitable properties of the game restricted to the remaining players, our results guarantee the existence of Nash equilibria characterized by either consensus or polarization within the first group, as well as their stability under best response transitions. Our results can be interpreted as robustness results, building on the super-modular properties of network coordination games and on a novel use of the concept of graph cohesiveness.
comment: 15 pages, 8 figures, 1 table
MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment
Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to integrate with various offline MARL methods seamlessly. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.
comment: 21 pages, 4 figures
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation ICML 2026
Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at https://github.com/cszhangzhen/RADAR.
comment: Accepted by ICML 2026 (fix typos)
Systems and Control (EESS)
Integrated Sensing and Covert Communication In Low-Altitude Networks: A Smart Radio Environment Perspective
The rise of low-altitude economies and 6G is driving the evolution of low-altitude networks (LANs), making communication security a pressing concern. Unlike traditional security approaches, covert communication offers enhanced protection by hiding the transmission behavior itself. Integrated sensing and communication (ISAC), a key technology of 6G, efficiently supports both sensing and communication tasks through hardware integration, thereby promising significant gains for covert communication. Nevertheless, the complexity and dynamics of urban environments pose critical challenges. Drawing on the latest advances in smart radio environment (SRE) technologies, this paper introduces them into integrated sensing and covert communication (ISACC) to suppress covert channel fading and counteract sensing precision loss in LANs. We first survey the applications and state-of-the-art findings of ISACC in LANs, highlighting key practical challenges. Subsequently, we introduce the core concept of SRE and elaborate on its enabling techniques across four dimensions. To deliver more insights, we explore potential pathways for integrating SRE into ISACC. To maximize covert throughput, a reinforcement learning-based case study is conducted by jointly optimizing flight trajectory, jamming power, movable antenna position, bandwidth allocation, and beamforming vectors. Simulation results show that the proposed scheme achieves superior performance compared to the benchmark. Finally, some open challenges and potential directions are discussed.
Detecting Cyber Attacks in Power System AGC Using a Drifted Ornstein-Uhlenbeck Process
The Automatic Generation Control (AGC) system, reliant on real-time measurements over communication networks, is susceptible to stealthy false data injection attacks (FDIAs), risking equipment damage and economic losses. We propose a robust FDIA detection method using maximum likelihood estimation (MLE) of a drifted multivariate Ornstein-Uhlenbeck (OU) process. Independent of load observability, in various cyberattack scenarios, the proposed FDIA detection method delivers accurate and rapid detection of sophisticated FDIAs, outperforming traditional unknown input observer (UIO) methods, which miss detections, and Long Short-Term Memory Autoencoder (LSTM-AE) approaches, which suffer from prolonged detection times.
AI-Based KPI Prediction Methods in Future 6G Networks: A Survey
The evolution from 5G to 5G-Advanced and the vision of 6G demand unprecedented levels of network performance, in which meeting stringent network Key Performance Indicators (KPIs), including capacity, latency, coverage, and reliability, is critical to supporting emerging applications such as autonomous driving, industrial automation, and immersive communications. Traditional reactive network management is insufficient in this context, driving the need for predictive, data-driven approaches. Machine Learning (ML) has emerged as a key enabler, enabling the forecasting of KPI trends from diverse data sources and thereby enabling proactive, AI-native automation in mobile networks. This survey provides the first comprehensive and systematic review of data-driven KPI prediction methods for future 6G networks. We introduce a multi-dimensional taxonomy that classifies prediction approaches by KPI type, data source, the network protocol stack at which the KPI is predicted, prediction horizon, model family, and prediction objective. Using this taxonomy, we analyze the state of the art across various KPIs, highlighting representative methods ranging from classical statistical models to deep learning and reinforcement learning. We further discuss enabling system aspects, including data collection and learning architectures, and examine deployment challenges, including data availability, scalability, privacy, and sustainability. Finally, we outline open research directions spanning new KPI definitions, probabilistic and explainable predictions. This survey aims to provide researchers and practitioners with a structured understanding of the KPI prediction landscape and a roadmap toward predictive network automation in future 6G systems.
comment: Submitted to IEEE Communications Surveys and Tutorials, 30 pages
Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions
Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.
comment: 6 pages, 4 figures, accepted at MIPRO 2026
Anti-Windup in PID Control: Review, Analysis, and New Tuning Directions
Actuator saturation is a fundamental nonlinearity that significantly degrades the performance of PID-controlled systems by inducing integrator windup, leading to overshoot, slow recovery, and even instability. Although numerous anti-windup strategies have been proposed, their practical tuning remains largely heuristic and suboptimal in many industrial scenarios. This paper presents a comprehensive comparative study of classical and advanced anti-windup techniques for PI-controlled first-order-plus-dead-time (FOPDT) processes under a wide range of operating conditions. The analysis includes dynamic and instantaneous back-calculation, conditional integration, and adapted schemes. In addition, a novel hybrid anti-windup strategy is proposed, combining conditional integration with dynamic back-calculation to improve responsiveness during saturation, whilst preserving smooth recovery dynamics. Moreover, a key contribution of this work is the development of systematic tuning rules for the tracking time constant in back-calculation schemes, specifically optimised for load-disturbance rejection. These rules are derived from an extensive optimisation study that considers the saturation ratio, controller aggressiveness, and disturbance characteristics. The resulting guidelines provide simple yet effective formulas that achieve near-optimal performance without requiring complex computations. Simulation results demonstrate that the proposed methods significantly outperform commonly used heuristic rules, particularly in disturbance rejection scenarios, and provide clear, practical recommendations for selecting and tuning anti-windup strategies in industrial applications.
Secure RSMA-based Visible Light Networks under Spatial Correlation
This paper investigates the secrecy sum rate (SSR) of rate-splitting multiple access (RSMA)-based visible light communication (VLC) systems considering internal eavesdropping, where legitimate users may intercept private data intended for others. We formulate an optimization problem to maximize the SSR of the system, which is inherently non-convex due to the complex coupling of the objective function and constraints. To this end, two different approaches based on the convex-concave procedure (CCCP) and semidefinite relaxation (SDR) are leveraged to solve the non-convex parameterized problem. A central focus of this work is the investigation of channel similarity (CS), which serves as a metric for quantifying spatial correlation, and its impact on SSR performance. To mitigate the performance degradation caused by high spatial correlation, we propose a channel similarity reduction (CSR) clustering strategy that proactively minimizes CS to restore the system's degrees of freedom (DoF). Numerical results are provided to demonstrate the performance of the two proposed algorithms under various levels of CS. More importantly, the findings reveal that our proposed CSR-clustering strategy significantly outperforms existing baselines, effectively overcoming the secrecy performance ceiling caused by high spatial correlation.
FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects
We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.
Deconstructing the Composite Channel for Beyond Diagonal RIS: Channel Estimation and Beamforming Design
As beyond-diagonal reconfigurable intelligent surfaces (BD-RISs) gain increasing attention in high-frequency wireless communications, accurate and scalable channel-estimation methods become essential. This paper develops a parametric channel-estimation and beamforming framework that deconstructs the composite BD-RIS channel into its generating directional factors, revealing the tensor structure induced jointly by propagation geometry and beyond-diagonal scattering. We propose two tensor-based estimators: Fourth-Order Tucker Channel Estimation (FORTE), which models the partially structured channel as a fourth-order Tucker tensor, and Fourth-Order PARAFAC Channel Estimation (FORPE), which captures the fully structured channel through a fourth-order PARAFAC model. By exploiting partial and full channel geometry, the proposed methods achieve higher estimation accuracy than Least Squares and Block Tucker Kronecker Factorization benchmarks. In particular, FORTE outperforms FORPE due to its more compact representation, attaining an NMSE of about 10^{-4} at 5 dB SNR. In contrast, FORPE provides essentially unique estimates of the composite-channel factor matrices, whereas FORTE identifies their subspaces. The proposed deconstruction also provides a structured representation useful for sensing-oriented parameter extraction and tensor-structured system optimization. Finally, the Tensor Optimization Framework for Beamforming, Combining, and Scattering (TenFormer) achieves spectral efficiency comparable to the benchmark design while significantly reducing computational complexity through parallel tensor-structured optimization.
Making Aggregations Reliable: Realizability Guarantees for Battery Fleets with Heterogeneous Power and Energy Limits
Aggregated battery energy storage systems (BESS) enable large fleets of heterogeneous battery elements to participate in system-level optimization and electricity markets. Scheduling each element independently is computationally impractical at scale. While many aggregate battery models rely on convex relaxations, they often ignore element complementarity constraints, leading to dispatch solutions that may be infeasible when implemented on individual battery elements. This paper develops a realizable composite battery model for parameter-heterogeneous BESS fleets that guarantees feasibility at the element-level while preserving computational tractability. We derive simple linear conditions under which aggregate charging and discharging trajectories can be safely disaggregated while respecting individual power limits, energy limits, and complementarity constraints under a priority-based controller. Numerical experiments in a unit-commitment setting demonstrate that the proposed realizable composite battery formulation produces feasible dispatch solutions. Solve times are effectively independent of system size, unlike micro-model mixed-integer formulations. Solutions obtained from the proposed formulation converge to the optimal benchmark as control granularity is refined. Additional studies illustrate the robustness of the framework to moderate violations of key modeling assumptions, including heterogeneous power-to-energy ratios.
Power System CBFs
Control barrier functions (CBFs) have become a standard tool in safety critical-control systems. CBFs convert state constraints into real time control conditions that certify forward invariance (meaning that once the system starts in a safe region, it remains there for all future times) and minimally modify a nominal controller only when safety is at risk. In power systems, CBF based methods have been proposed for frequency and voltage safety, but they largely remain disconnected from three key features that are central to power system operation: differential algebraic equation (DAE) models that capture network power flow constraints, safety specifications involving algebraic variables such as bus voltages, and formal verification of the resulting closed loop system. This paper closes this gap by developing a CBF framework for power system DAE models that supports safety constraints on both dynamic and algebraic variables. The framework provides real time safety filtering through an optimization layer that wraps around an existing controller and minimally modifies its command to enforce safety. In addition, it provides formal verification (i.e., a mathematical guarantee that all admissible trajectories satisfy the prescribed safety constraints) through an offline reachability based certificate of safe operation. The result is a unified filter and verify methodology for enforcing and certifying frequency and voltage safety in power systems while preserving the DAE structure of the underlying model.
Koopman operator learning for predictive control via Khatri-Rao kernel regression
This paper develops a data-driven realization of the generalized Koopman operator (GeKo), in which states and inputs are lifted independently and the dynamics are expressed as a tensor bilinear system. The first contribution is a time-sequenced multi-step Khatri-Rao kernel regression formulation that exposes the operator to evolved snapshots along trajectories rather than only single one-step pairs, which reduces compounded prediction error. Secondly, we develop a kernel- and input-agnostic structured SVD reduction that compresses the lifted state and input spaces while preserving the Khatri-Rao realization. We instantiate the framework with random Fourier features and describe a complete predictive-control pipeline, including a multi-step roll-out diagnostic that guides the choice of MPC horizon. The framework is validated on the chaotic Lorenz system, where the learned reduced-order GeKo model stabilizes an unstable equilibrium from a range of initial conditions.
Towards Efficient Synthesis of Quantum Graph States by Fusing Graph Motifs
Photonic graph states with advanced topologies can enable measurement-based quantum computing, distributed quantum sensing, and quantum interconnects. However, the efficient generation of photonic graph states is limited by the probabilistic nature of photonic entangling operations and the exponential dependence of generation rate on resource cost. In this work, we study photonic graph state synthesis as a cost-aware decomposition problem, exploiting local Clifford (LC) equivalence to identify more synthesis-friendly representations of the target graph state before decomposition. Specifically, we propose Cost-aware Fusion-based Decomposition (CFD), a three-stage heuristic framework that decomposes a target graph state into ring, star, and linear motifs, and assembles them via Type-I fusion operations to minimize fusion overhead and physical-qubit consumption. We further show that selecting the LC-equivalent graph state with the minimum number of edges provides a highly effective proxy for near-optimal synthesis: in many cases it matches the best generation rate observed within the LC equivalence class under CFD, and in most remaining cases it remains close to it. Numerical evaluations on graph state orbit data and 2D and 3D lattice graph states demonstrate that CFD achieves up to 84.6\% reduction in resource overhead compared to baseline constructions, and yields improvements in photonic generation rate spanning multiple orders of magnitude. These results suggest that combining structure-aware motif decomposition with LC equivalence is a practical and scalable strategy for photonic graph state synthesis.
Package-Embedded Coupled Inductor Arrays for High-Performance Computing Power Delivery
A novel power delivery framework, comprising a package-embedded inductor topology and an inductance-island methodology, is introduced to maximize both inductance and current densities in vertical power delivery (VPD). The framework leverages multiple multi-phase converters, a common strategy in high-performance computing systems, to enhance efficiency and scalability. The proposed topology employs an array of tightly coupled spiral square inductors sharing a common magnetic rod, serving multiple converters operating in the same conversion phase. The array is optimized to maximize coupling and minimize conversion losses, achieving superior inductance and current densities of 250 nH/mm^2 and 10 A/mm^2, respectively. At the system level, the inductance-island methodology partitions the power delivery network into multiple islands, each dedicated to a converter phase and supplying a portion of the load current, thereby enabling scalable and efficient distribution. To validate the framework, the inductor array is designed and simulated in ANSYS Maxwell 3D and Mechanical, exhibiting an average quality factor of 23.6 and efficiency of 97.4% at 2 A load current, 6 V input, and 10 MHz switching frequency. The inductor array netlist is extracted from ANSYS and co-designed in Cadence Virtuoso with a distributed dual-phase power conversion system, ensuring joint optimization of passive and active components. The co-designed converter achieves a significant efficiency gain of 5.65% on average and up to 11.04% at 40 A load over a similar converter with uncoupled inductors, demonstrating the practical benefits of the approach.
comment: 11 page, 13 figures, 7 tables, accepted for publication in IEEE Transactions on Components, Packaging, and Manufacturing Technology (T-CPMT), Special Section on Vertical Power Delivery for Next-Generation Advanced Packaging Systems
Terminal Time and Angle-Constrained Nonlinear Intercept Guidance
This paper considers the problem of simultaneously controlling an interceptor's impact time and impact angle using its lateral acceleration as the sole control input. With a single control input, the nonlinear engagement kinematics is inherently underactuated, which complicates guidance law synthesis. To overcome this challenge, a hierarchical sliding mode-based guidance law is developed to concurrently regulate the two terminal constraints. The proposed architecture consists of a two-layer sliding manifold. The first layer comprises two sub-sliding surfaces corresponding to the impact time and impact angle error dynamics, respectively, while the second layer introduces a composite sliding manifold that combines the two individual sub-surfaces. Then, a variable-gain adaptive guidance law is designed to ensure time and angle-constrained interception against a stationary target, which is further extended to intercept a constant velocity target. Simulations are conducted for various engagement scenarios to attest to the efficacy of the proposed approach.
PI and PID Tuning of Plants up to Third Order for a Monotonic Minimum Settling Time Solution
A unified, closed-form analytical PI/PID tuning method is presented for all-pole plants up to third order that yields a strictly monotonic (zero-overshoot) step response with minimum settling time. The design target is the binomial closed loop p^n/(s+p)^n, which is monotonic with robustness depending only on the order n. Because adding a left-half-plane zero to a fixed pole pattern only slows the response, the minimum-settling solution requires the controller zeros to be cancelled, which forces the controller numerator to divide the plant denominator. Carrying this principle through shows that an exact, real-gained solution exists for any stable plant precisely up to second order with a PI controller and third order with a PID controller; the residual binomial factor acquires a complex pair beyond that, which a generic plant does not contain. Explicit gains are derived for first-order plants (PI), second-order plants with real and complex poles (PI and PID), and third-order plants with three real poles and with one real pole plus a complex pair (PID). The second-order PI case is treated in full as the lowest-order instance. Monotonicity guarantees Mt = 1, hence Ms less then 2, phase margin above 60 degree, and gain margin above 6 dB, tightening to universal constants for the binomial family. Numerical verification confirms the results.
Fairness as an Investment: Dynamic Participation and Long-Run Profit in Virtual Power Plants
We show that incorporating fairness constraints into virtual power plant (VPP) operations can incentivize consumer participation and thus improve the aggregator's long-run profitability. VPPs rely on sustained participation from heterogeneous consumers to provide a variety of grid services whose timing and frequency are often uncertain. As a result, consumers' willingness and ability to provide flexibility evolve over time, creating a dynamic link between past participation and future resource availability. We develop a dynamic aggregation framework to study how fairness in service allocation affects future participation and long-run profitability. By linking current dispatch decisions to future resource availability, we show that fairer allocations can strengthen consumer engagement, expand aggregate availability, and create additional value during high-price and high-demand events. To balance fairness and operational efficiency, we introduce a slack-augmented allocation mechanism that preserves most of the participation benefits from fairness while avoiding unnecessary reductions in service procurement. We derive conditions under which the resulting availability gains outweigh the short-run cost of redistribution and validate the approach using real-world consumer behavior and electricity market data from Norway.
Corridor Design and Separation Definition in Advanced Air Mobility: Systematic Literature Review
Advanced Air Mobility (AAM) uses electric vertical take-off and landing (eVTOL) vehicles to address urban congestion and emissions. However, corridor design, operation management, and separation standards remain underexamined for safe high-density operations. This paper applies the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to systematically review relevant literature from IEEE Xplore and Web of Science, focusing on publications from 2010 to 2024. A Context, Intervention, Mechanism, and Outcome (CIMO) framework guided the development of research questions. After screening 2,039 journal and conference papers, 62 articles met the inclusion criteria. The findings reveal a lack of integrated corridor design approaches, limited operational strategies, and reliance on standards originally designed for conventional aviation. A unified corridor design and separation definition frameworks and taxonomies are proposed to address these shortcomings, informing future investigations and operational frameworks for safe, efficient eVTOL operation deployment in urban settings.
Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics
Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot's ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot's uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot's runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.
comment: Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)
A No-Regret Framework for Adaptive Incentive Design
Incentive design studies how a central authority can influence strategic agents through payments, subsidies, or taxes, so that individual objectives align with collective welfare. This paper introduces a No-Regret Adaptive Incentive Design (RAID) framework for nonlinear games with continuous action spaces and private agent costs. In this framework, the authority (planner) designs incentives that regulate the Nash equilibrium toward a socially optimal action profile, while simultaneously learning agents' unknown preferences from repeated strategic responses. We formulate the RAID problem and construct a least-squares estimator whose strong consistency requires only diminishing excitation. Leveraging this weak excitation requirement, we propose a switching incentive policy that alternates between probing (exploration) and estimate-based (exploitation) incentives. The resulting policy achieves an $O(t^{-0.5})$ parameter estimation rate and accumulates $O(t^{0.5}\log t)$ squared social-cost regret, almost surely. We further extend the framework to an endogenous-noise response model, where standard least-squares estimation is biased due to an error-in-variables correlation between the noise and agent responses. We utilize a repeated-sampling estimator and corresponding switching policy that retain the same almost-sure convergence and regret rates. Numerical experiments validate the effectiveness and predicted convergence rates of the method.
comment: 21 pages, 5 figures
A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic decision framework that resolves this trade-off by formulating method selection as a two-player zero-sum game between a Monitor (selecting computational methods and parameters) and Nature (selecting the unknown traffic scenario). We construct an end-to-end pipeline from trajectory surveillance data through eight candidate detection algorithms, a Monte Carlo sensitivity analysis characterizing their stochastic performance, and finally a multi-objective optimization layer that identifies Pareto-optimal method portfolios. The minimax solution provides a robust mixed strategy with a probability distribution over methods that guarantees worst-case performance regardless of scenario uncertainty. Experimental evaluation across 200 randomized configurations spanning 5--50 aircraft demonstrates that the framework recommends distinct method portfolios depending on operational priority: Koopman Phase dominates balanced (70.6%) and speed-priority (79.7%) profiles, while CRQA emerges as primary (47.4%) when route-lead identification is prioritized. The framework achieves a guaranteed game value of 0.29--0.53 (normalized utility) across all tested preference profiles, providing the first principled, scenario-adaptive methodology for computational method selection in UTM fleet monitoring operations.
Certified Closed-Loop Control for Packet Networks: A Compositional Certification Framework
Packet networks are controlled dynamical systems with discontinuities, delayed observations, and partial state information. Adaptive or learning-driven proposers can improve performance, but an unsafe proposal may still cause starvation, tail-delay spikes, or unstable queue behaviour. This paper treats packet-network control as an executed-action certification problem. A certified operator sits between any proposer and the dataplane. At each control tick, the proposer emits an arbitrary candidate action $\tilde u(t)$. The operator either projects it to an executable action $u(t)$ that satisfies a configuration-compiled certificate, or reports INFEASIBLE and executes an always-defined fallback with quantified slack. The certificate also exports an auditable envelope $\bar z(t)$ for downstream composition. The guarantees are conditional and explicit. They apply on ticks where the operator reports CERTIFIED, the declared arrival envelope and backlog bound are valid, and the platform realises the assumed service lower bound. Under these conditions, one mechanism covers backlog caps, service floors, mitigation caps, Foster--Lyapunov drift constraints, and compositional envelope contracts. We prove operator-level safety, feed-forward compositional safety and stability using exported envelopes, and a cyclic closure result under a small-gain condition. We also define breach and infeasibility semantics, discuss calibration of the service-tracking factor that links certified targets to realised scheduler behaviour, and evaluate the design under delayed telemetry, delayed actuation, weak proposers, envelope mismatch, overload, and millisecond-scale certification. The present evaluation validates the certified execution boundary in a byte-level closed-loop backend; deployment-level scheduler tracking is left to future Linux or hardware experiments.
comment: 29 pages, 11 figures, 3 tables
Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction
State-space models are traditionally based on physical knowledge, but multi-step predictions from these physical models can be poor due to model inaccuracy. Black-box deep learning has shown promise as an alternative. However, these methods rely on the availability of large datasets and potentially available physical knowledge is neglected. We propose the PG-RSSNN, a physics-guided recurrent state-space neural network that incorporates recurrent structures to enable the use of non-saturating activation functions in multi-step prediction. It mitigates the vanishing gradients and eliminates the risk of numerical divergence in training seen in existing structures that feed back state estimates. Results across multiple systems with various physical model imperfections, from linear state-space models with Gaussian noise to a robotic arm and a cascaded water tank system, show that the proposed PG-RSSNN maintains stable training behavior, and improves multi-step predictions, as compared with black-box neural networks and physics-only models, even with limited training data and when physical models are only partially known.
comment: 6 pages, 3 figures. Accepted at IFAC World Congress 2026
S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty
Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models -- exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.
Switched Event-Triggered Adaptive Control of Reaction-Diffusion PDE-ODE with Neural Operator Implementation
This paper develops a switched event-triggered adaptive boundary control for a class of reaction-diffusion PDE-ODE cascade systems, where the system and input matrices in the ODE as well as the spatially-varying reaction coefficient in the PDE are uncertain. A two-step backstepping transformation is constructed to derive the continuous-time control law. Then a novel dynamic event-triggered control strategy for the PDE-ODE cascade is proposed based on a switched event-triggering mechanism, ensuring global exponential stability of the closed-loop system in place of the exponential convergence commonly achieved with backstepping-based classical dynamic ETC, while inherently excluding Zeno behavior. To address the uncertainties in the PDE-ODE cascade, adaptive update laws are developed, leading to time-varying gain kernels that are adaptively scheduled through the event-triggered control mechanism. Furthermore,to facilitate efficient real-time implementation, deep neural operators (DeepONet) are employed to approximate the backstepping kernels as mappings from the estimated parameters to kernel functions, thereby eliminating the need to repeatedly solve kernel PDEs online. Through a Lyapunov analysis that incorporates the effects of the event-triggering mechanism, parameter adaptation, and kernel approximation errors, we prove the $L^2$ global asymptotic regulation of the resulting closed-loop system. In summary, the key contributions of the paper are threefold: (i) developing an adaptive DeepONet-based framework for reaction-diffusion PDE-ODE cascade systems; (ii) extending the existing adaptive event-triggered control design for reaction-diffusion PDEs to the case with more complex uncertainties; and (iii) generalizing switched dynamic ETC with global exponential stability to PDE-ODE cascades. The effectiveness of the proposed approach is demonstrated through numerical simulations.
Solar Daylighting to Offset LED Lighting in Vertical Farming: A Techno-Economic Study of Light Pipes
Vertical farming is a controlled-environment agriculture (CEA) approach in which crops are grown in stacked layers under regulated climate and lighting, enabling predictable production but requiring high electricity input. This study quantifies the techno-economic impact of roof-mounted daylighting in a three-tier container vertical farm using a light-pipe (LP) system that delivers sunlight to the upper tier. The optical chain, comprising a straight duct and a tilting aluminum-coated mirror within a rotating dome, was modelled in Tonatiuh to estimate crop-level photon delivery and solar gains. These outputs were coupled with a transient AGRI-Energy model to perform year-round simulations for Dubai. Tier-3 strategies were compared against a fully LED benchmark, including daylight-only operation, on/off supplementation, PWM dimming, UV-IR filtering, variable-transmittance control, and simple glazing. Ray-tracing predicted an overall LP optical efficiency of 45%-75%, depending on solar position, quantifying the fraction of incident daylight at the collector aperture delivered to the target growing zone. Daylight-only operation reduced the total three-tier yield by 17% and was not economically viable despite 27-29% electricity savings. Hybrid daylight-LED strategies preserved benchmark yield while reducing electricity use. PWM dimming combined with UV-IR filtering achieved the lowest specific electricity energy consumption (6.32 kWh/kg), 14% below the benchmark. Overall, viability remains CAPEX-limited because achievable electricity savings are insufficient to offset the added investment and thus improves mainly under high electricity and carbon-price contexts, although the LP system delivers a 15-38% lower light cost than an optical-fiber reference under identical incident daylight.
Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic
Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.
React to Surprises: Stable-by-Design Neural Feedback Control and the Youla-REN
We study parameterizations of stabilizing nonlinear policies for learning-based control. We propose a structure based on a nonlinear version of the Youla-Kucera parameterization combined with robust neural networks such as the recurrent equilibrium network (REN). The resulting parameterizations are unconstrained, and hence can be searched over with first-order optimization methods, while always ensuring closed-loop stability by construction. We study the combination of (a) nonlinear dynamics, (b) partial observation, and (c) incremental closed-loop stability requirements (contraction and Lipschitzness). We find that for the combination of (c) with either (a) or (b), a contracting and Lipschitz Youla parameter always leads to contracting and Lipschitz closed loops. However, if all three hold, then incremental stability can be lost with exogenous disturbances. Instead, a weaker condition is maintained, which we call d-tube contraction and Lipschitzness. We further obtain converse results showing that the proposed parameterization covers all contracting and Lipschitz closed loops for certain classes of nonlinear systems. Numerical experiments illustrate the utility of our parameterization when learning controllers with built-in stability certificates for: (i) ``economic'' rewards without stabilizing effects; (ii) short training horizons; and (iii) uncertain systems.
Excitation of control-affine systems and Koopman error bounds
The Koopman operator and extended dynamic mode decomposition (EDMD) as a data-driven technique for its approximation have attracted considerable attention as a key tool for modeling, analysis, and control of complex dynamical systems. However, extensions towards control-affine systems resulting in bilinear surrogate models are prone to demanding data requirements rendering their applicability intricate. In this paper, we propose a framework for data-fitting of control-affine mappings to increase the robustness margin in the associated system identification problem and, thus, to provide reliable bilinear EDMD schemes. In particular, guidelines for input selection based on subspace angles are deduced such that a desired threshold with respect to the minimal singular value is ensured. Moreover, we derive necessary and sufficient conditions of optimality for maximizing the minimal singular value. Further, we demonstrate the usefulness of the proposed approach using bilinear EDMD with control for nonholonomic robots.
Secure Two-Party Matrix Multiplication from Lattices and Its Application to Encrypted Control
In this study, we propose a two-party computation protocol for approximate matrix multiplication of fixed-point numbers. The proposed protocol is provably secure under standard lattice-based cryptographic assumptions and enables matrix multiplication at a desired approximation level within a single round of communication. We demonstrate the feasibility of the protocol by applying it to the secure implementation of a linear control law. Our evaluation reveals that the client achieves lower online computational complexity compared to the original controller computation, while ensuring the privacy of controller inputs, outputs, and parameters. Furthermore, a numerical example confirms that the proposed method maintains sufficient precision of control inputs even in the presence of approximation and quantization errors.
comment: 6 pages, 2 figures
Data-Efficient Control of Polynomial Systems via Physics-Guided Quadratic Constraints
This work addresses the critical challenge of guaranteeing safety for complex dynamical systems where precise mathematical models are uncertain and data measurements are corrupted by noise. We develop a physics-guided, direct data-driven framework for synthesizing robust safety controllers for discrete-time nonlinear polynomial systems that are subject to unknown-but-bounded disturbances. To do so, we introduce a notion of safety through robust control barrier certificates, which ensure avoidance of unsafe regions, offering a less conservative alternative to existing methods based on robust invariant sets. To achieve data efficiency, we further integrate physical information, formulated as quadratic constraints on system and control matrices, with observed noisy data. This integration drastically reduces data requirements, enabling robust safety analysis with significantly shorter trajectories compared to purely data-driven methods. The proposed synthesis procedure is formulated as a sum-of-squares optimization program that systematically designs the barrier and its associated controller by leveraging both collected data and underlying physical laws. The efficacy of our framework is demonstrated on three benchmark systems, confirming its ability to offer robust safety guarantees with reduced data demands.
A Tutorial on Learning-Based Radio Map Construction: Data, Paradigms, and Physics-Awareness
Radio maps (RMs) provide the digital representation of the wireless propagation environment, mapping complex geographical and topological boundary conditions to critical spatial-spectral metrics that range from received signal strength to full channel state information matrices. The integration of artificial intelligence into next generation wireless networks further necessitates the accurate construction of RMs as a foundational prerequisite for electromagnetic digital twins. This paper presents a comprehensive survey of learning-based RM construction, systematically addressing three intertwined dimensions: data, paradigms, and physics-awareness. From the data perspective, we review physical measurement campaigns, ray tracing simulation engines, and publicly available benchmark datasets, identifying their respective strengths and fundamental limitations. From the paradigm perspective, we establish a core taxonomy that categorizes RM construction into source-aware forward prediction and source agnostic inverse reconstruction, and examine five principal neural architecture families spanning convolutional neural networks, vision transformers, graph neural networks, generative adversarial networks, and diffusion models. We further survey optics-inspired methods adapted from neural radiance fields and 3D Gaussian splatting for continuous wireless radiation field modeling. From the physics-awareness perspective, we introduce a three-level integration framework encompassing data-level feature engineering, loss-level partial differential equation regularization, and architecture level structural isomorphism. Open challenges including foundation model development, physical hallucination detection, and mortized inference for real-time deployment are discussed to outline future research directions. The project page is at https://github.com/UNIC-Lab/Awesome-Radio-Map-Categorized.
Statistical Guarantees in Data-Driven Nonlinear Control: Conformal Robustness for Stability and Safety
We present a true-dynamics-agnostic, statistically rigorous framework for establishing exponential stability and safety guarantees of closed-loop, data-driven nonlinear control. Central to our approach is the novel concept of conformal robustness, which robustifies the Lyapunov and zeroing barrier certificates of data-driven dynamical systems against model prediction uncertainties using conformal prediction. It quantifies these uncertainties by leveraging rank statistics of prediction scores over system trajectories, without assuming any specific underlying structure of the prediction model or distribution of the uncertainties. With the quantified uncertainty information, we further construct the conformally robust control Lyapunov function (CR-CLF) and control barrier function (CR-CBF), data-driven counterparts of the CLF and CBF, for fully data-driven control with statistical guarantees of finite-horizon exponential stability and safety. The performance of the proposed concept is validated in numerical simulations with four benchmark nonlinear control problems.
Robustness of Incentive Mechanisms Against System Misspecification in Congestion Games
To steer the behavior of selfish, resource-sharing agents in a socio-technical system towards the direction of higher efficiency, the system designer requires accurate models of both agent behaviors and the underlying system infrastructure. For instance, traffic controllers often use road latency models to design tolls whose deployment can effectively mitigate traffic congestion. However, misspecifications of system parameters may restrict a system designer's ability to influence collective agent behavior toward efficient outcomes. In this work, we study the impact of system misspecifications on toll design for atomic congestion games. We prove that tolls designed under sufficiently minor system misspecifications, when deployed, do not introduce new Nash equilibria in atomic congestion games compared to tolls designed in the noise-free setting, implying a form of local robustness. We then upper bound the degree to which the worst-case equilibrium system performance could decrease when tolls designed under a given level of system misspecification are deployed. We validate our theoretical results via Monte-Carlo simulations as well as realizations of our worst-case guarantees.
comment: 6 pages, 3 figures
Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.
comment: 15 pages, accepted to Robotics: Science and Systems (RSS) 2026
An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays
In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.
comment: 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice
LC-SAC: Lyapunov-Constrained Soft Actor-Critic via Koopman Operator Theory for Trajectory Tracking and Stabilization
Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. In this work we propose a Lyapunov-Constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We learn a linear lifted surrogate of the error dynamics via Extended Dynamic Mode Decomposition (EDMD) and solve the Discrete Algebraic Riccati Equation (DARE) to obtain a closed-form quadratic candidate Control Lyapunov Function (CLF). This CLF is incorporated into the SAC actor update as a Lagrangian penalty that aggregates the worst-case tail of violations via a Conditional Value-at-Risk (CVaR) objective, concentrating constraint pressure on rare but severe instability events. We further introduce three structural EDMD refinements spectral-radius normalization of the lifted A-matrix prior to the DARE solve, a physically meaningful LQR state cost, and a value-bias anchor enforcing V(0)=0 that make the closed-form CLF well-posed for higher-dimensional lifted models such as the cartpole and 3D quadrotor. The ablation study shows that a hard Lagrangian constraint is essential, replacing it with reward shaping (Lyap-RS-SAC) destabilizes learning and collapses return on quadrotor tasks.
comment: 13 pages, 8 Figures
Impulse-to-Peak-Output Norm Optimal State-Feedback Control of Linear PDEs
Impulse-to-peak response (I2P) analysis for state-space ordinary differential equation (ODE) systems is a well-studied classical problem. However, the techniques employed for I2P optimal control of ODEs have not been extended to partial differential equation (PDE) systems due to the lack of a universal transfer function and state-space representation. Recently, however, partial integral equation (PIE) representation was proposed as the desired state-space representation of a PDE, and Lyapunov stability theory was used to solve various control problems, such as stability and optimal ${H}_\infty$ control. In this work, we utilize this PIE framework, and associated Lyapunov techniques, to formulate the I2P response analysis problem as a solvable convex optimization and obtain provable bounds for the I2P-norm of linear PDEs. Moreover, by establishing strong duality between primal and dual formulations of the optimization problem, we develop a constructive method for I2P optimal state-feedback control of PDEs and demonstrate the effectiveness of the method on various examples.
comment: This paper has been submitted to IEEE-LCSS and IEEE CDC 2026 for review. The LA-UR is the evidence that this document has been approved for unlimited release by LANL
Polynomial Constraints for Robustness Analysis of Nonlinear Systems
This paper presents a framework for abstracting uncertain or non-polynomial components of dynamical systems using polynomial constraints. This enables the application of polynomial-based analysis tools, such as sum-of-squares programming, to a broader class of non-polynomial systems. A numerical method for constructing these constraints is proposed. The relationship between polynomial constraints and existing integral quadratic constraints (IQCs) is investigated, providing transformations of IQCs into polynomial constraints. The effectiveness of polynomial constraints in characterizing nonlinearities is validated via numerical examples to compute inner estimates of the region of attraction for two systems.
Dynamic Gradient-Based Calibration for Robust and Accurate Traffic Macrosimulation SC
Robust and accurate calibration of macroscopic traffic flow models such as METANET is critical for reliable prediction and effective control. While gradient-based methods are desirable for high-dimensional parameter spaces, their application to real-world traffic scenarios is hindered by highly nonconvex optimization landscapes. Consequently, standard static calibration frequently yields parameter sets that produce unstable, unrealistic traffic dynamics, undermining confidence in the estimated parameters and compromising the simulation's utility for counterfactual scenario testing. To address this, we propose a dynamic, rolling-horizon calibration framework. By reformulating static one-time estimation into a dynamic control problem, parameters better maintain stability and accuracy amid measurement noise. Using real-world data from the I-24 MOTION testbed, this work empirically characterizes the instability of standard methods. It then shows that the proposed approach simultaneously enhances robustness to perturbations and achieves a 48% improvement in predictive accuracy over conventional static calibration.
comment: Accepted to IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026
A fast reduced order method for linear parabolic inverse source problems
In this paper, we propose a novel, computationally efficient reduced order method to solve linear parabolic inverse source problems. Our approach provides accurate numerical solutions without relying on specific training data. The forward solution is constructed using a Krylov sequence, while the source term is recovered via the conjugate gradient (CG) method. Under a weak regularity assumption on the solution of the parabolic partial differential equations (PDEs), we establish convergence of the forward solution and provide a rigorous error estimate for our method. Numerical results demonstrate that our approach offers substantial computational savings compared to the traditional finite element method (FEM) and retains equivalent accuracy.
comment: Unfinished work
Credit-Based vs. Discount-Based Congestion Pricing: A Comparison Study
Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP), which respectively allot travel credits and toll discounts to subsidize low-income users' access to tolled roads, have emerged as promising policies for alleviating the societal inequity concerns of congestion pricing. However, since real-world implementations of CBCP and DBCP are nascent, their relative merits remain unclear. In this work, we compare the efficacy of deploying CBCP and DBCP in reducing user costs and increasing toll revenues. We first formulate a non-atomic congestion game in which low-income users receive a travel credit or toll discount for accessing tolled lanes. We establish that, in our formulation, Nash equilibrium flows always exist and can be computed or well approximated via convex programming. Our main result establishes a set of practically relevant conditions under which DBCP provably outperforms CBCP in inducing equilibrium outcomes that minimize a given societal cost, which encodes user cost reduction and toll revenue maximization. Finally, we validate our theoretical contributions via a case study of the 101 Express Lanes Project, a CBCP program implemented in the San Francisco Bay Area.
On Signed Network Games with Binary Actions
We study binary-action pairwise-separable graphical games that encompass both coordination and anti-coordination network games. Our model is grounded in an underlying directed signed graph, where each link is associated with a signed weight that describes both nature and the strength of the strategic pairwise interaction. Specifically, positive link weight corresponds to a strategic complement type interaction, whereas negative link weight corresponds to strategic substitute type interaction. The utility for each player is then an aggregation of pairwise terms determined by the weights of the signed graph in addition to an individual bias term. We consider a scenario that assumes the presence of a prominent cohesive subset of players, who are either connected exclusively by positive weights, or form a structurally balanced subset that can be bipartitioned into two adversarial subcommunities with positive intra-community and negative inter-community edges. Under suitable properties of the game restricted to the remaining players, our results guarantee the existence of Nash equilibria characterized by either consensus or polarization within the first group, as well as their stability under best response transitions. Our results can be interpreted as robustness results, building on the super-modular properties of network coordination games and on a novel use of the concept of graph cohesiveness.
comment: 15 pages, 8 figures, 1 table
Robust $\mathcal{H}_\infty$ Observer Design via Finsler's Lemma and IQCs
This paper develops a Finsler-based LMI for robust $\mathcal{H}_\infty$ observer design with integral quadratic constraints (IQCs) and block-structured uncertainty. By introducing a slack variable that relaxes the coupling between the Lyapunov matrix, the observer gain, and the IQC multiplier, the formulation addresses two limitations of the standard block-diagonal approach: the LMI requirement $\mathrm{He}(PA) \prec 0$ (which fails for marginally stable dynamics), and a multiplier--Lyapunov trade-off that causes infeasibility for wide uncertainty ranges. For marginally stable dynamics, artificial damping in the design model balances certified versus actual performance. The framework is demonstrated on quaternion attitude estimation with angular velocity uncertainty and mass-spring-damper state estimation with uncertain physical parameters.
A Lyapunov-Based Small-Gain Theorem for Fixed-Time Stability
This paper introduces a novel Lyapunov-based small-gain methodology for establishing fixed-time stability (FxTS) guarantees in interconnected dynamical systems. Specifically, we consider interconnections in which each subsystem admits an individual fixed-time input-to-state stability (ISS) Lyapunov function that certifies FxT-ISS. We then show that if a nonlinear small-gain condition is satisfied, then the entire interconnected system is FxTS. Our results are analogous to existing Lyapunov-based small-gain theorems developed for asymptotic and finite-time stability, thereby filling an important gap in the stability analysis of interconnected dynamical systems. The proposed theoretical tools are further illustrated through analytical and numerical examples, including the first result on fixed-time feedback optimization of dynamical systems without time-scale separation between the plant and the controller.
BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization
Artificial intelligence (AI) is anticipated to emerge as a pivotal enabler for the forthcoming sixth-generation (6G) wireless communication systems. However, current research efforts regarding large AI models for wireless communications primarily focus on fine-tuning pre-trained large language models (LLMs) for specific tasks. This paper investigates the large-scale AI model designed for beamforming optimization to adapt and generalize to diverse tasks defined by system utilities and scales. We propose a novel framework based on bidirectional encoder representations from transformers (BERT), termed BERT4beam. We aim to formulate the beamforming optimization problem as a token-level sequence learning task, perform tokenization of the channel state information, construct the BERT model, and conduct task-specific pre-training and fine-tuning strategies. Based on the framework, we propose two BERT-based approaches for single-task and multi-task beamforming optimization, respectively. Both approaches are generalizable for varying user scales. Moreover, the former can adapt to varying system utilities and antenna configurations by re-configuring the input and output module of the BERT model, while the latter, termed UBERT, can directly generalize to diverse tasks, due to a finer-grained tokenization strategy. Extensive simulation results demonstrate that the two proposed approaches can achieve near-optimal performance and outperform existing AI models across various beamforming optimization tasks, showcasing strong adaptability and generalizability.
Robotics
Global Convergence of a Line-Search Filter Differential Dynamic Programming Method
In this article, we establish the global convergence properties of the FilterDDP algorithm, which extends the discrete-time differential dynamic programming (DDP) algorithm of Mayne and Jacobson [\emph{International Journal of Control}, 3, (1966), pp. 85-95] to handle nonlinear constraints over states and controls, in addition to the dynamics. FilterDDP adopts a line-search filter procedure for step acceptance. However, instead of a damped Newton step applied in the general nonlinear programming setting, the computation of a trial point involves applying a backward recursion and a forward simulation. We establish the global convergence of FilterDDP by showing that for a subset of constrained optimal control problems, the this backward-forward procedure satisfies the same properties as a Newton step for the purpose of establishing global convergence of a line-search filter method, following the analysis of Wächter and Biegler [\emph{SIAM Journal on Optimization}, 16 (2005), pp. 1-31].
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World
Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.
comment: https://legsvla.github.io/
A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception ICRA 2026
Underwater robots typically use both cameras and sonar for perception to leverage the rich semantic details of vision and the robust range measurements of acoustics. However, learning to map between these modalities via cross-modal prediction remains underexplored due to limited sonar-visual paired datasets. We present SOVIS, a sonar-visual dataset for cross-modal underwater perception. SOVIS comprises over 76,000 paired frames collected across 17 dives at six sites in the Trondheimfjord, supported by an end-to-end pipeline that cleans and synchronizes the cross-modal sensor data. We also introduce an interactive annotation tool designed to accelerate the labeling process for this paired data. Finally, we demonstrate a proof-of-concept cross-modal fish detection task using a small subset of labeled data, achieving a 7x improvement in mAP@0.10 over a monocular camera baseline. SOVIS serves as the first step toward advancing cross-modal underwater perception research, enabling research directions such as dense sonar prediction from monocular images.
comment: 6 pages, 7 figures, 3 tables. Accepted to IEEE ICRA 2026 S2S Workshop (From Sea to Space: Advancing Perception in Harsh Domains)
Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision
A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.
comment: 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper
ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo ICRA 2026
Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.
comment: ICRA 2026
S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot
We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.
PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making ICML 2026
Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: https://psg-nav.github.io/
comment: 21 pages, 7 figures. ICML 2026
DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance
Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.
OneVLA: A Unified Framework for Embodied Tasks
Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.
Training-Free Imitation Learning with Closed-Form Diffusion Policies
While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.
ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning
Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.
comment: Video demo: https://www.youtube.com/watch?v=Ng1alP0yhc0
Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees
The application of multi-agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the \textit{IEEE Very Small Soccer (VSSS)} category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots' behavior during the game is crucial to win it. In this paper, we present a Behavior-Tree-based approach to support multi-robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S$\tilde{a}$o Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.
comment: 7 pages, 7 figures. Preprint of a manuscript submitted to the XXVI Congresso Brasileiro de Automática (CBA 2026)
Time-Optimal Collision Avoidance Via a Greedy Polynomial Backward Sweep
Spacecraft collision avoidance for low-thrust satellites often requires determining not only how to maneuver, but also how late a maneuver can begin while still ensuring safety. This paper presents a greedy time-optimal (GTO) backward-sweep method to find the latest maneuver initiation time. The method starts from the nominal time of closest approach and iteratively propagates the maneuver backward in time, selecting at each step the thrust direction that locally minimizes the chosen danger metric. Differential algebra is used to efficiently propagate state sensitivities and update the time of closest approach online. The method is tested on a large dataset of conjunctions, using both miss distance and probability of collision as safety metrics. The approach achieves accurate results and only a small loss of optimality relative to an optimal-control benchmark, while retaining runtimes suitable for on-board implementation.
Tether-Aware Dynamic Collision Avoidance for USV-HROV Systems
Heterogeneous marine robotic systems composed of an unmanned surface vehicle (USV) and a hybrid remotely operated vehicle (HROV) have shown great potential for subsea cable inspection. In such missions, the USV tracks the HROV at the surface while supplying power and communication through an umbilical tether. However, dynamic collision avoidance for the USV during HROV tracking is challenging because the submerged tether may scrape against passing vessels, while evasive maneuvers can enlarge the USV--HROV separation, thereby increasing the likelihood of tether tautness and compromising HROV operations. To address these challenges, this work proposes a tether-aware dynamic collision avoidance method for a USV tracking an HROV. First, a tether safety-aware planar domain is introduced to represent the three-dimensional collision risk between the tether and obstacle vessels without an explicit tether shape model. Second, a tether tautness-aware velocity obstacle method is developed to achieve safe avoidance while reducing the likelihood of tether tautness. Finally, the method is integrated with line-of-sight guidance to coordinate HROV tracking and collision avoidance. Gazebo-based simulations show that the proposed method avoids dynamic obstacle vessels while maintaining tether safety and reducing the likelihood of tether tautness during USV evasive maneuvers.
Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry
Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.
Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA
Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.
Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation
Robotic manipulation requires the effective integration of heterogeneous inputs, including visual observations, language instructions, and trajectory representations, to generate accurate actions. Existing transformer-based policies typically process these heterogeneous modalities within a shared parameter space, which often leads to modality interference and inefficient representation learning, especially in data-scarce scenarios. While Mixture-of-Experts (MoE) offers a scalable solution through expert specialization, conventional routing mechanisms are often sensitive to such cross-modal representation discrepancies, resulting in unstable expert assignment and expert collapse. In this work, we propose MATE (Multi-ModAl TrajEctory Policies), a novel trajectory prediction framework built upon MoE. Specifically, we introduce a Multi-Modal MoE architecture to achieve fine-grained sub-token feature decoupling, and design a cross-modal cosine router for stable and scale-invariant expert assignment across heterogeneous modalities. We further employ temperature-controlled routing and stochastic noise injection to improve expert balance and prevent premature routing collapse under scarce demonstrations. Experiments on the LIBERO benchmark show that our MATE consistently outperforms prior work under data scarcity. It achieves a 4.75% improvement in average success rate over the trajectory-guided counterpart. Real-world experiments on robotic ping-pong also suggest that the predicted trajectories can provide useful guidance for downstream robotic execution, further indicating the practical feasibility of our algorithm.
Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties
This paper presents a new robust integrated planning and control (IPC) strategy for multirotor uncrewed aerial vehicles. We propose a nonlinear model predictive control (NMPC) formulation that embeds control barrier functions (CBFs) as exponential penalties, improving feasibility while ensuring smooth obstacle avoidance under tight input bounds. The penalty weights provide a practical tuning knob to trade off tracking accuracy against avoidance aggressiveness. We enhance the system robustness by employing a high-gain disturbance observer (HGDO) to estimate and compensate for external disturbances. We also incorporate a Kalman filter (KF) for computationally efficient, real-time prediction of obstacle motion, enabling avoidance of moving obstacles. Comparative studies against both conventional NMPC and NMPC with hard CBF constraints, validated in Gazebo and hardware experiments, demonstrate superior feasibility, safety, and robustness. To the best of our knowledge, this is the first hardware-validated NMPC-CBF IPC framework, offering a practical step toward safe quadrotor deployment in dynamic environments.
comment: Accepted to Conference on Robots and Vision (CRV 2026), Vancouver, Canada
Position: Good Embodied Reward Models Need Bad Behavior Data ICML 2026
This position paper argues that to obtain reliable embodied reward models, the community must invest in ``bad'' robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.
comment: This position paper has been accepted by the ICML 2026 position track as a spotlight paper
$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation
Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.
comment: Our project homepge: https://finch.agibot.com/research/tau0-wm
AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics
The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.
comment: 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 2 Billion grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.
OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation
A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.
comment: 8 pages main text, appendices included
Make Your VLA More Robust Without More Data By Interleaving Motion Planning
Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.
Threading Optimization for Vision-Language-Action Model Inference in Low-Cost Smart Agricultural Manipulation
Vision-Language Action (VLA) models continue to face challenges such as slow inference speed and difficulty performing fine-grained motion adjustments, limiting their widespread adoption in industry. While the Real-Time Action Chunking (RTAC) algorithm has been proposed to address these bottlenecks, bridging the gap between the algorithm provided in pseudocode to a stable, real-world deployment on a low-cost robotic arm remains a challenge. In this work, we present a complete system-level implementation of RTAC tailored for a low-cost robotic manipulation system. We advance beyond the original high-level pseudocode by optimizing the threading implementation for the policy inference and control pipeline, reducing end-to-end latency and improving responsiveness without modifying the underlying policy. We evaluate this system on tasks involving the manipulation of agricultural produce, specifically garlic bulbs and walnuts. Experimental results demonstrate that our custom threading implementation significantly improves control stability and speed compared to the base implementation of RTAC.
Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints ICRA
We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.
comment: Accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA) 2026. Revised version with more exposition in methodology and updated results with improved implementation
Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks
Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GenAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GenAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy dynamic but precise goal-reaching, ball-in-a-cup, and table tennis policies, trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.
LLM Trainer: Automated Robotic Data Generation via Demonstration Augmentation using LLMs ICRA 2026
We present LLM Trainer, a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and pose-object relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot. For additional materials and demonstration videos, please see the project website: https://sites.google.com/andrew.cmu.edu/llm-trainer
comment: 9 pages, 5 figures, 4 tables. Accepted in ICRA 2026
CrazyMARL: Decentralized Direct Motor Control Policies for Cooperative Aerial Transport of Cable-Suspended Payloads ICRA
Collaborative transportation of cable-suspended payloads by teams of UAVs has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slack-taut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized RL framework for multi-UAV cable-suspended payload transport. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim-to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.
comment: International Conference on Robotics and Automation (ICRA), 2026
HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization
To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases. Our project website is available at https://HaoZhang-THU.github.io/HALO/.
comment: https://HaoZhang-THU.github.io/HALO/
FDIO: Frequency Decomposed Inertial Odometry
Pedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.
DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle
We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.
comment: This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies ICML 2026
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.
comment: Accepted by ICML 2026. 17 pages
Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling
Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings. Our code is available at https://github.com/XiaolongTang23/Plan-R1.
Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation
We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.
comment: This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/
Improving Diffusion Planners by Self-Supervised Action Gating with Energies
Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.
Contrastive Representation Regularization for Vision-Language-Action Models ICML 2026
Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.
comment: ICML 2026
DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization
Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods tend to struggle once test-time conditions differ from the demonstrations, such as changes in lighting, texture, viewpoint, object placement, or object identity. To address this challenge, we propose DIffusion POlicy with compLementarity Encoders (DIPOLE), a visuomotor policy that learns to fuse complementary modalities through a training-time mechanism rather than a specialized fusion architecture. A modality-wise dropout masks one branch at each training step, encouraging each modality to remain individually informative. A lightweight cross-attention layer then exchanges complementary cues between the two. This design endows DIPOLE with five core strengths: stable high performance across diverse tasks, robustness to visual changes, spatial generalization at sub-centimeter precision, emergent capability beyond either modality, and zero-shot transfer to unseen objects. Across 18 simulated and 4 real-world tasks, DIPOLE outperforms six baselines by 39.1% on average, with gains of 41.5% under unseen visual distractors and 15.2% under randomized object placement.
A Predictive Control Strategy to Offset-Point Tracking for Agricultural Mobile Robots
Robots are increasingly being deployed in agriculture to support sustainable practices and improve productivity. They offer strong potential to enable precise, efficient, and environmentally friendly operations. However, most existing path-following controllers focus solely on the robot's center of motion and neglect the spatial footprint and dynamics of attached implements. In practice, implements such as mechanical weeders or spring-tine cultivators are often large, rigidly mounted, and directly interacting with crops and soil; ignoring their position can degrade tracking performance and increase the risk of crop damage. To address this limitation, we propose a closed-form predictive control strategy extending the approach introduced in [1]. The method is developed specifically for Ackermann-type agricultural vehicles and explicitly models the implement as a rigid offset point, while accounting for lateral slip and lever-arm effects. The approach is benchmarked against state-of-the-art baseline controllers, including a reactive geometric method, a reactive backstepping method, and a model-based predictive scheme. Real-world agricultural experiments with two different implements show that the proposed method reduces the median tracking error by 24% to 56%, and decreases peak errors during curvature transitions by up to 70%. These improvements translate into enhanced operational safety, particularly in scenarios where the implement operates in close proximity to crop rows.
comment: Accepted in the journal IEEE Transaction on Field Robotics
MiNI-Q: A Miniature, Wire-Free Quadruped with Unbounded, Independently Actuated Leg Joints
Physical joint limits are common in legged robots and can restrict workspace, constrain gait design, and increase the risk of hardware damage. This paper introduces MiNI-Q^2, a miniature, wire-free quadruped robot with independently actuated, mechanically unbounded 2-DOF leg joints. We present the mechanical design, kinematic analysis, and experimental validation of the proposed robot. The leg mechanism enables both oscillatory gaits and rotary locomotion while allowing the robot to fold to a minimum height of 2.5 cm. Experimentally, MiNI-Q achieves speeds up to 0.46 m/s and demonstrates low-clearance crawling, stair climbing, inverted locomotion, jumping, and backflipping. The wire-free architecture extends our previous Q8bot design, improving assembly reliability at miniature scale. All mechanical and electrical design files are released open source to support reproducibility and further research.
comment: 7 pages, 11 figures. Submitted to the IEEE RAS Conference on Ubiquitous Robots (UR 2026)
SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models ACL
Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities remain underexplored. We identify a fundamental security flaw in modern VLA systems: the combination of action chunking and delta pose representations creates an intra-chunk visual open-loop. This mechanism forces the robot to execute K-step action sequences, allowing per-step perturbations to accumulate through integration. We propose SILENTDRIFT, a stealthy black-box backdoor attack exploiting this vulnerability. Our method employs the Smootherstep function to construct perturbations with guaranteed C2 continuity, ensuring zero velocity and acceleration at trajectory boundaries to satisfy strict kinematic consistency constraints. Furthermore, our keyframe attack strategy selectively poisons only the critical approach phase, maximizing impact while minimizing trigger exposure. The resulting poisoned trajectories are visually indistinguishable from successful demonstrations. Evaluated on the LIBERO, SILENTDRIFT achieves a 93.2% Attack Success Rate with a poisoning rate under 2%, while maintaining a 95.3% Clean Task Success Rate.
comment: Accepted to ACL Findings 2026
SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning
Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.
PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation
Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.
comment: New version with updated framing, contributions, experiments, and figures
Multiagent Systems
LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble -- a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 -- generate with one model, review with another -- ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding -- all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken -- all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.
comment: 12 pages, 9 figures, 5 tables
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models
Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model's attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.
comment: Work in progress
Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees
The application of multi-agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the \textit{IEEE Very Small Soccer (VSSS)} category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots' behavior during the game is crucial to win it. In this paper, we present a Behavior-Tree-based approach to support multi-robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S$\tilde{a}$o Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.
comment: 7 pages, 7 figures. Preprint of a manuscript submitted to the XXVI Congresso Brasileiro de Automática (CBA 2026)
When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding
Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.
FinCom: A Financial Multi-Agent Demo with Disagree-or-Commit Deliberation
Multi-agent systems powered by large language models (LLMs) are increasingly used for financial analysis and decision support. However, existing coordination schemes, especially those emphasizing consensus or debate, are vulnerable to sycophancy: agents conform to peer reasoning instead of evidence, leading to premature agreement and degraded outcomes. We introduce FinCom (Financial Committee), a governed multi-agent framework and interactive system that operationalizes the Disagree-or-Commit (DoC) protocol to embed structured dissent into financial AI committees. A central Supervisor orchestrates three ReAct-enabled specialist agents: Research, Quantitative, and Risk. Each agent is equipped with role-specific tools for retrieval, computation, and stress testing. During deliberation, agents must either explicitly critique or commit to their peers' reasoning before converging on a unified recommendation. This demonstration showcases how FinCom supports committee-style financial analysis through coordinated multi-agent interaction, including structured report generation and interactive decision support. Evaluated across the most recent financial agent benchmark, in addition to 90 internal handcrafted financial tasks using an LLM-as-a-Judge protocol, DoC improves reasoning accuracy and risk awareness significantly over a consensus-seeking baseline on both an in-house and external evaluation set. By reframing disagreement as a governance primitive rather than noise, FinCom offers a lightweight, prompt-only recipe for improving accountability, transparency, and epistemic robustness in agentic financial systems.
The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.
comment: 41 pages, 9 figures, 20 tables
CrazyMARL: Decentralized Direct Motor Control Policies for Cooperative Aerial Transport of Cable-Suspended Payloads ICRA
Collaborative transportation of cable-suspended payloads by teams of UAVs has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slack-taut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized RL framework for multi-UAV cable-suspended payload transport. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim-to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.
comment: International Conference on Robotics and Automation (ICRA), 2026
Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies
Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost of fraud, difficulty in evaluating service quality, and instability of service content. These compounding vulnerabilities can trigger population-level trust collapse and the proliferation of short-sighted strategies. We propose Ev-Trust, an evolutionarily stable trust mechanism that addresses these vulnerabilities through three targeted designs: a cross-validation gate leveraging requestor semantic comprehension to assess response validity, a variance-standardized drift measure filtering endogenous stochasticity from genuine behavioral anomalies, and an embedding of trust signals into the expected revenue function that converts trustworthiness into an evolutionary survival advantage. Based on replicator dynamics with a noisy best response micro-foundation, we prove the asymptotic stability of cooperative evolutionarily stable strategies and derive explicit threshold conditions for maintaining cooperative equilibria. We evaluate Ev-Trust through 100-round simulations with at least 100 heterogeneous LLM-driven agents covering seven behavioral types. The experiments are conducted on TruthfulQA and TriviaQA, two factual question-answering benchmarks. Compared to baselines based on transitive trust aggregation, reinforcement-learning reputation, and pure evolutionary imitation, Ev-Trust reduces malicious agent participation by approximately 60%, suppresses the fraudulent service rate by approximately 50%, and maintains stable trust differentiation under a 30% adversarial mutation. These results demonstrate that coupling semantic trust evaluation with evolutionary incentives provides a principled foundation for securing cooperation in decentralized LLM-based multi-agent systems.
comment: 19 pages, 9 figures
Scalable Ride-Sourcing Vehicle Rebalancing with Service Accessibility Guarantee: A Constrained Mean-Field Reinforcement Learning Approach
The expansion of ride-sourcing services such as Uber and Lyft has reshaped urban transportation by offering flexible, on-demand mobility via mobile applications. Despite convenience, these platforms confront significant operational challenges, particularly vehicle rebalancing-strategic repositioning of a fleet of vehicles to address spatiotemporal mismatches in supply and demand. Inadequate rebalancing results in prolonged rider waiting times and inefficient vehicle utilization, but also leads to fairness issues, such as the inequitable distribution of service and disparities in driver income. To tackle these, we introduce continuous-state mean-field control (MFC) and mean-field reinforcement learning (MFRL) models with continuous repositioning actions. MFC and MFRL offer scalable solutions by modeling each vehicle's behavior through interaction with the vehicle distribution, rather than with individual vehicles. This mitigates the curse of dimensionality with respect to the number of agents, enabling coordination across large fleets with significantly reduced computational complexity and eliminating the need to retrain the model when fleet size changes. To ensure equitable service access across geographic regions, we integrate an accessibility constraint into models and derive rebalancing policies that strike a balance between high fulfillment of rider demand and fair coverage of vehicle supply. Extensive evaluation using data-driven simulation of Shenzhen demonstrates the efficiency and robustness of our approach. Remarkably, it scales to tens of thousands of vehicles, with training times comparable to linear programming rebalancing. Besides, our policies effectively explore the efficiency-equity Pareto front, outperforming conventional benchmarks across key metrics like fleet utilization, fulfilled requests, and pickup distance, while ensuring equitable service access.
comment: 34 pages, 15 figures
Systems and Control (EESS)
Global Convergence of a Line-Search Filter Differential Dynamic Programming Method
In this article, we establish the global convergence properties of the FilterDDP algorithm, which extends the discrete-time differential dynamic programming (DDP) algorithm of Mayne and Jacobson [\emph{International Journal of Control}, 3, (1966), pp. 85-95] to handle nonlinear constraints over states and controls, in addition to the dynamics. FilterDDP adopts a line-search filter procedure for step acceptance. However, instead of a damped Newton step applied in the general nonlinear programming setting, the computation of a trial point involves applying a backward recursion and a forward simulation. We establish the global convergence of FilterDDP by showing that for a subset of constrained optimal control problems, the this backward-forward procedure satisfies the same properties as a Newton step for the purpose of establishing global convergence of a line-search filter method, following the analysis of Wächter and Biegler [\emph{SIAM Journal on Optimization}, 16 (2005), pp. 1-31].
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
Hosting Capacity Assessment and Enhancement for Edge Data Centers in Active Distribution Networks
With the increasing demand for edge computing and AI-driven workloads, integrating small and medium-sized edge data centers into distribution networks has become increasingly important. This paper investigates the hosting capacity of distribution networks for data center integration and identifies the key physical mechanisms that limit the maximum allowable data center load. The baseline analysis shows that data center hosting capacity varies significantly across candidate buses due to network topology and electrical distance. Three dominant limiting mechanisms are identified: current-constrained locations, voltage-constrained locations, and mixed-constrained locations where both current loading and voltage deviation jointly affect hosting capacity. To increase the hosting capacity, this study evaluates multiple flexible resources, including battery energy storage systems (BESS), dispatchable distributed generators (DDG), and static synchronous compensators (STATCOM). Numerical results demonstrate that these resources provide complementary benefits through active power support, sustained local generation, and reactive power compensation, effectively expanding data center hosting capacity in distribution systems.
Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision
A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.
comment: 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper
A Koopman Set-Membership Approach for Nonlinear Data-Driven Control with Stability Guarantees
This paper proposes a data-driven controller design method for unknown nonlinear systems based on a Koopman bilinear realization. Using Koopman operator theory, the nonlinear system can be represented as a bilinear discrete-time system with a residual error term. The residual error is proportionally bounded by the norm of the lifted state and input, while the system matrices of the bilinear model are unknown. Assuming that bounds on the residual error are available, the unknown system matrices are characterized via a set-membership representation using the collected input-state data pairs of the nonlinear system. A data-driven controller design method is proposed to ensure stability for all bilinear systems within this set-membership description and for all admissible residual errors. More specifically, we design a rational state-feedback controller that stabilizes the bilinear model with residual error and, consequently, the original nonlinear system, by solving a sum-of-squares (SOS) program. The effectiveness of the proposed approach is demonstrated through numerical examples.
All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning
Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccuracies of the learned dynamics model are typically exploited by the agent, substantially hampering the capabilities of MBRL methods. We present a framework for dealing with inaccuracies of probabilistic models through targeted handling of uncertainty that effectively mitigates model exploitation. We present recent successes in learning directly on hardware and safe exploration, and discuss future directions for uncertainty-aware MBRL.
Toward Reliable Semantic Communication: Beyond Average Performance
Semantic communication has emerged as a promising paradigm for improving transmission efficiency by conveying task-relevant semantics rather than raw data. Although recent studies have achieved notable gains in communication efficiency and average task performance, reliability remains a fundamental bottleneck in dynamic and uncertain environments. In particular, most existing designs are still optimized mainly for average-case behavior, while lower-tail performance under adverse transmission conditions remains insufficiently understood and inadequately protected. In this article, we present a unified perspective on reliable semantic communication beyond average performance. We first review three reliability-oriented design categories: channel-aware adaptation, robustness-oriented codec design, and hybrid automatic repeat request (HARQ)-based retransmission. We show that these approaches address reliability from complementary perspectives, but each still has inherent limitations. Motivated by these observations, we discuss two solution directions: robust adaptive semantic communication under imperfect CSI, and joint source-channel-check coding with adaptive retransmission for sample-level reliability enhancement. Finally, we outline several future research directions, including the joint design of robustness and retransmission, reliability metrics beyond averages, and compatibility with existing digital wireless networks.
comment: 7 pages, 4 figures
DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance
Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.
Regulating EV Charging Markets for Fairness: Incentives for Pricing and Capacity Decisions
The transition to electric mobility calls for charging infrastructure that is both efficient and socially equitable. This paper examines fairness in electric vehicle (EV) charging station pricing and capacity through a game-theoretic perspective. We model a non-cooperative market in which competing charging service providers set prices and capacities while customers choose stations based on generalized cost, leading to a market equilibrium. We then benchmark this decentralized outcome against an idealized planner solution that jointly optimizes efficiency and equity. To align market outcomes with socially desirable goals, we design targeted incentives that guide operators toward more fair charger placement. Case studies demonstrate that unregulated competition tends to exacerbate disparities in charger access across demographic groups, whereas carefully calibrated incentives can reduce inequities without significant efficiency loss. The framework provides insights for policymakers on reconciling free-market dynamics with the broader societal goals of fairness in electrified mobility systems.
Efficient Numerical Modeling of Near-Field Diffraction in ORIS-Assisted Free-Space Optical Links
This paper investigates near-field propagation in optical reconfigurable intelligent surface (ORIS)-assisted free-space optical (FSO) communication systems. Unlike conventional far-field scenarios, near-field propagation involves complex diffraction effects that hinder tractable closed-form analysis. To address this issue, a numerical framework for evaluating the optical field distribution of ORIS-assisted FSO links is proposed. Specifically, two numerical approaches are considered: direct Riemann-sum evaluation and a fast Fourier transform (FFT)-based method. Although the Riemann sum approach provides accurate field estimation, it incurs extremely high computational complexity due to the fine spatial discretization of the ORIS surface required at optical wavelengths. To improve computational efficiency, the optical-field calculation is reformulated as a convolution in the spatial-frequency domain, enabling efficient FFT-based propagation analysis. Simulation results demonstrate that the proposed FFT-based method achieves accuracy comparable to that of the Riemann-sum approach while significantly reducing computational complexity.
Data-Driven Min-Max MPC with Integral Quadratic Constraints
Data-driven control of nonlinear systems with rigorous guarantees is a challenging control problem. Integral quadratic constraints (IQCs) provide a powerful framework for modeling nonlinearities. This paper presents a data-driven min-max model predictive control (MPC) synthesis method for unknown systems subject to (nonlinear) uncertainties using the IQC framework. The unknown system matrices are characterized by a set-membership representation using the input-state data and the knowledge of the IQCs. We derive two semidefinite programs (SDPs) that minimize an upper bound on the worst-case cost over all possible system dynamics and uncertainties. By iteratively solving these SDPs, the proposed state-feedback control law is obtained. We further prove that the resulting closed-loop system is exponentially stable and satisfies the input and state constraints. A numerical example demonstrates the validity of the proposed method.
AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics
The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.
comment: 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal
Power Grid Infrastructure for AI Data Centers
This article addresses recent advances in artificial intelligence, which have set off an astounding race among technology frontiers to build large data centers. It provides insights into impacts of large data centers on the planning and operation of the power grid.
Learning in Stackelberg Markov Games
Designing socially optimal policies in multi-agent environments is a fundamental challenge in both economics and artificial intelligence. This paper studies a general framework for learning Stackelberg equilibria in dynamic and uncertain environments, where a single leader interacts with a population of adaptive followers. Motivated by pressing real-world challenges such as equitable electricity tariff design for consumers with distributed energy resources (such as rooftop solar and energy storage), we formalize a class of Stackelberg Markov games and establish the existence and uniqueness of stationary Stackelberg equilibria under mild continuity and monotonicity conditions. We then extend the framework to incorporate a continuum of agents via mean-field approximation, yielding a tractable Stackelberg-Mean Field Equilibrium (S-MFE) formulation. To address the computational intractability of exact best-response dynamics, we introduce a softmax-based approximation and rigorously bound its error relative to the true Stackelberg equilibrium. Our approach enables scalable and stable learning through policy iteration without requiring full knowledge of follower objectives. We validate the framework on an energy market simulation, where a public utility or a state utility commission sets time-varying rates for a heterogeneous population of prosumers. Our results demonstrate that learned policies can simultaneously achieve economic efficiency, equity across income groups, and stability in energy systems. This work demonstrates how game-theoretic learning frameworks can support data-driven policy design in large-scale strategic environments, with applications to real-world systems like energy markets.
Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints ICRA
We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.
comment: Accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA) 2026. Revised version with more exposition in methodology and updated results with improved implementation
Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning
Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and trains policies from scratch using Multi-Agent Proximal Policy Optimization (MAPPO) under a fixed computational budget. The candidates are then evaluated on the basis of their performance, and selection across generations solely based on the sparse task returns. The framework is evaluated in four Overcooked-AI layouts characterized by varying levels of corridor congestion, handoff dependencies, and structural asymmetries. The proposed reward design approach consistently yields higher task returns and delivery counts, with the most pronounced gains observed in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components reveals stronger interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the proposed LLM-guided reward search framework mitigates the need for manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.
Data-Enabled Predictive Control with Predictive Adaptive Line-of-Sight Guidance for 3-D Path Following of Autonomous Underwater Vehicles
This paper presents a fully data-driven 3-D path-following framework for autonomous underwater vehicles (AUVs), a representative class of underwater field robotics, based on Data-Enabled Predictive Control (DeePC). The approach eliminates explicit hydrodynamic modeling by exploiting measured input-output trajectories to predict and optimize future system behavior. Classic DeePC is employed for heading control, while a cascaded DeePC architecture with loop-frequency separation is proposed for depth regulation, extending DeePC to plants whose dominant output evolves significantly slower than the actuator bandwidth. For 3-D waypoint path following, the Adaptive Line-of-Sight (ALOS) guidance law is extended to a predictive multistep formulation (PALOS) that supplies the horizon-consistent reference required by receding-horizon predictive controllers. All methods are validated in high-fidelity 6 degrees of freedom simulation on the REMUS~100 AUV under nominal operation, ocean-current disturbances, operation beyond the data regime, and 3-D waypoint path following, consistently outperforming the corresponding state-of-the-art benchmarks. In 3-D waypoint path following, the framework reduces cross-track error by approximately 28\% relative to the ALOS-PI/PID baseline.
comment: 16 pages, 6 figures
Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation
We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.
comment: This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/
Sensitivity increase of 3D printed, self-sensing, carbon fibers structures with conductive filament matrix due to flexural loading
The excellent structural and piezoresistive properties of continuous carbon fiber make it suitable for both structural and sensing applications. This work studies the use of 3D printed, continuous carbon fiber reinforced beams as self-sensing structures. It is demonstrated how the sensitivity of these carbon fiber strain gauges can be increased irreversibly by means of a pretreatment by pre-stressing the sensors with a large compressive bending load. The increase in the gauge factor is attributed to local progressive fiber failure, due to the combination of the thermal residual stress from the printing process and external loading. The coextrusion of conductive filament around the carbon fibers is demonstrated as a means of improving the reliability, noise and electrical connection of the sensors. A micrograph of the sensor cross section shows that the conductive filament contacts the various carbon fiber bundles. All-in-all, the use of pre-stressing carbon fiber strain gauges in combination with coextrusion of conductive filament hold promises for 3D printed structural sensors with a high sensitivity.
comment: 28 pages, 9 figures, 4 tables
Robotics
Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance
Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter-agent interactions. This paper presents a framework for coordinated multi-robot motion planning that combines decentralized generative trajectory planning with multi-agent reinforcement learning (MARL)-based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single-agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient-based steering, enabling interaction-aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi-agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value-guided diffusion planning reduces the inter-agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL-based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi-robot model.
comment: 11 pages, 6 figures, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2026
A Machine-to-Machine Knowledge-Guided LLM Agent for Generalizable Radiotherapy Treatment Planning
In this work, we propose a prototype machine-to-machine (M2M) knowledge-guided Large Language Model (LLM) framework for automated radiotherapy treatment planning. In the proposed paradigm, Treatment Planning Parameter (TPP) distribution knowledge discovered by a Deep Reinforcement Learning (DRL) agent is transferred to an LLM agent through in-context learning, enabling autonomous iterative planning without human intervention. While standard LLM-based planning often lacks physical intuition and struggles with convergence, the integration of DRL-derived guidance constrains the agent to a physically valid parameter space. Experimental evaluations are performed across three diverse planning scenarios: basic prostate cases, complex prostate configurations with increased organ-at-risk (OAR) constraints, and liver cases. The evaluation results demonstrate that the guided LLM agent consistently achieves optimal planning scores while significantly reducing the number of iterations compared to unguided planning. Analysis of the final TPP configurations reveals that the agent successfully learns a hierarchical priority of objectives, effectively restoring a logical "cause-and-effect" relationship between parameter tuning and dosimetric outcomes. Crucially, this prototype framework exhibits robust generalizability, maintaining high planning quality regardless of specific patient anatomy, treatment site, or initial plan quality. By bridging the specialized optimization of DRL with the adaptive reasoning of LLMs, this M2M framework establishes a scalable foundation towards generalizable autonomous treatment planning, ultimately benefiting clinical practice in realistic environments.
comment: 10 pages, 6 figures
GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation CVPR 2026
Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5\%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50\%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5\%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.
comment: Accepted to AI4Space at CVPR 2026
From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction
Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP
comment: 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning
Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.
comment: Project page: https://cofi-diffusion.github.io
SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models
Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing simulator-based VLA benchmarks. It formalizes task-aware safety requirements as Signal Temporal Logic (STL) specifications and reports native success with two unsafe-success metrics: Succ-But-Unsafe (SBU), the fraction of rollouts that both succeed and violate safety, and Violation Severity Index (VSI), a bounded worst-violation depth score. We instantiate SafeVLA-Bench on LIBERO and RoboCasa-365, evaluating nine policy-benchmark entries across tabletop and kitchen manipulation tasks. High task success does not imply safe execution: high-SR tabletop baselines still leave 13 to 15 percent unsafe-episode rates,and 36 to 56 percent of successful RoboCasa-365 rollouts violate at least one active safety clause. Project page: https://safevla.org.
comment: 27 pages, 5 figures
STEM: Semantic Target Search and Exploration using MAVs in Cluttered Environments
Autonomous target search is crucial for deploying Micro Aerial Vehicles (MAVs) in emergency response and rescue missions. Existing approaches either focus on 2D semantic navigation in structured environments -- which is less effective in complex 3D settings, or on robotic exploration in cluttered spaces -- which often lacks the semantic reasoning needed for efficient target search. This paper overcomes these limitations by proposing a novel framework that utilizes a semantically-guided viewpoint planner to minimize target search and exploration time in unstructured 3D environments using an MAV. Specifically, we develop a combinatorial planner that generates efficient semantic exploration plans by prioritizing viewpoints that likely lead to the target. To guide the planner towards the target, an active perception pipeline is developed that propagates semantic priorities of observed objects into neighboring frontier voxels for computing semantic information gains of frontier viewpoints. In addition, we demonstrate how LLM-based similarity scores can be leveraged as semantic priority input to our pipeline. Evaluations in two distinct simulation environments show that the proposed method consistently outperforms baselines by quickly finding the target while maintaining reasonable exploration times. Real-world experiments with an MAV further demonstrate the method's ability to handle practical constraints like limited battery life, small sensor range, and semantic uncertainty.
comment: Accepted to Autonomous Robots Journal. Nikhil Sethi and Max Lodel contributed equally
Beyond Pure Sampling: Hybrid Optimization Mechanisms for Non-Convex Model Predictive Control
This paper investigates the optimization mechanisms of non-convex Model Predictive Control (MPC) using the Maximum Entropy Differential Dynamic Programming (ME-DDP) framework. Navigating non-convex cost landscapes induced by nonlinear dynamics, multiple obstacles, etc. remains a fundamental challenge in robotics, where gradient-based methods frequently converge to suboptimal local minima. We demonstrate a dual-step optimization mechanism designed to overcome these traps. (1) an initial phase of using DDP to exploit the gradient of the cost landscape, followed by (2) disruption of the optimization via sampling from policies characterized by the inverse Hessian of the action-value function. We provide a rigorous analysis of this sampling mechanism of three ME-DDP variants: Unimodal Gaussian ME-DDP, Multimodal Gaussian ME-DDP, and Stein Variational DDP. Furthermore, with navigation tasks of four robotic systems under cluttered environments, we conduct extensive benchmarking of three variants of the ME-DDP, against deterministic DDP, and one of the most successful sampling-based schemes, Model Predictive Path Integral (MPPI) control with three policy parameterizations and update laws that correspond to those of ME-DDPs. The results show that in low-dimensional systems where the cost landscapes are relatively simple and local information is sufficiently representative, our framework consistently outperforms MPPIs. In high-dimensional systems, MPPI can occasionally discover aggressive maneuvers that enable it to steer the systems faster than DDP-based methods, whereas our method maintains a higher, more stable success rate. Finally, we validate the practical efficacy of the framework through hardware experiments with a quadrotor navigating a dense, non-convex obstacle field, confirming the robustness of the proposed framework for real-world deployment.
comment: 28 pages, 13 figures
Infeasible optimization problems and the hierarchical augmented Lagrangian method in imitation learning
Imitation learning (IL) is an effective approach to train complex robotics policies. Recent works have introduced hard constraints into imitation-learning optimization problems to ensure safety, stability, and robustness of the learned policy. However, we argue that these constraints are sometimes infeasible, which can lead to unstable or difficult training dynamics. We study a simple remedy for such situations based on recent theoretical results on the augmented Lagrangian method in infeasible settings. We show that our approach drives the learned policy toward the solution of a closest-feasible constrained IL problem with desirable properties. The method is illustrated on a toy driving example with a total-acceleration constraint and pedestrian-safety constraints, a setting in which infeasibility can naturally arise while still allowing a safe learned policy.
BEVIO: Efficient Bird's-Eye-View based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation
Visual-Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird's Eye View (BEV)-based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day-night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.
comment: Accepted at the 2026 IEEE International Conference on Robotics and Automation, Vienna
Shape Your Body: Value Gradients for Multi-Embodiment Robot Design
We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.
comment: 25 pages, 10 figures
Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion
Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.
Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation
Reliable mobile manipulation in dynamic indoor environments requires a scene representation that remains geometrically consistent, semantically queryable, and computationally bounded as the environment changes. Existing systems often rely on pre-built maps, static-scene assumptions, or highly accurate camera poses, which can lead to stale or misaligned scene information when target objects are relocated or pose estimates are corrected. This paper presents DREAM, a real-robot mobile manipulation framework that integrates perception, memory, localization, navigation, and manipulation in previously unseen indoor environments without a pre-built map. DREAM constructs an online spatio-semantic voxel memory from RGB-D observations registered by a LiDAR-inertial-visual SLAM backend. It further introduces pose-graph-aware Redundancy-Aware Memory Pruning (RMP) to update historical observations after pose corrections while keeping long-horizon observation history bounded. For target localization and reacquisition, DREAM combines language-conditioned 3D retrieval, open-vocabulary image detection, and multimodal large language model based semantic verification. Real-robot experiments in four dynamic indoor laboratory scenes show that DREAM improves long-horizon task success rates from 40%-60% with DynaMem to 55%-70%, while maintaining a memory footprint of 0.37-0.63 GB and an online memory-update time of 0.43-0.53 s across scenes.
comment: Code, CAD model, and real-robot demonstrations are available at https://bjhyzj.github.io/dream-web
Edge-Based QoS-Aware Adaptive Task Placement: A Closed-Loop Control in Multi-Robot Systems
Multi-robot systems (MRS) increasingly offload compute-intensive perception tasks to edge nodes to meet strict time-sensitive Quality-of-Service (QoS) constraints. However, static task orchestration on a shared edge node can severely degrade QoS due to network latency, jitter, and edge-resource contention. We present a pilot edge-centric MRS testbed using Raspberry Pi nodes to evaluate a camera-to-manipulator pipeline under three modes: local execution, static offloading, and a QoS-aware Adaptive Task Placement (ATP) controller. ATP scores candidate placements using a multi-metric cost (normalized latency, CPU utilization, and switching overhead) over two-second control windows. The closed-loop visual servoing testbed is instrumented with sub-millisecond clock synchronization, network emulation, and detailed monitoring of multiple metrics across nodes to capture realistic jitter. Experimental results under compute-stress and network-fault scenarios show that static edge offloading reduces on-board CPU load but amplifies tail latency and deadline misses. In contrast, the QoS-aware ATP controller, by switching task placement based on measured latency and utilization thresholds, consistently lowers deadline violations and tail latency. Overall, the results position ATP as a practical edge-side control primitive for MRS and concrete design guidelines for Cloud-Edge Robotics deployments within the broader cloud-fog automation, while motivating QoS-aware multi-objective workload orchestration for industrial cyber-physical systems.
comment: 6 pages, 2 figure, 1 algorithm, accepted as a regular paper on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia
A Four-Tier Communication Architecture and Sim-to-Real Validation of a Graphical Open-Source Platform for Robotic Engineering Education
The persistent challenge in scaling authentic manipulator education within university laboratories is a structural dichotomy: commercial digital twins are often cost-prohibitive and rigidly scripted, whereas open-source robotics middleware (ROS) imposes steep technical and syntax barriers for novices. To resolve this logistical and educational friction, this Work-in-Progress (WiP) paper proposes a scalable four-tier communication architecture tailored for sustainable robotic curricula. Rather than focusing on software application design, our study examines the underlying data exchange mechanisms required to bridge visual conceptual environments with physical robotic endpoints, utilizing the Graphical Open-Source Platform (GOSP) as a foundational instantiation. This WiP details the framework's technical integration of 3D visual armature modeling with a robust ROS middleware backend, emphasizing the serialization, routing, and encapsulation of intricate communication routines. Preliminary sim-to-real validation using multi-axis spatial trajectories confirms that encapsulating these communication pipelines provides a sufficient fidelity hardware-agnostic pathway. By bridging virtual design and physical execution, this architectural blueprint offers a viable infrastructure for engineering education.
comment: 4 pages, 4 figures, accepted as a Work-in-Progress (WiP) paper, on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking
Recent vision-language-action and diffusion-based robot policies often use action chunking, where each policy query predicts a sequence of future actions and the robot executes an open-loop prefix before re-querying. While this interface improves local motion continuity, deployment still requires choosing the execution horizon: how much of each predicted chunk should be executed before acquiring a new observation. However, our experiments show that success is strongly task-dependent and non-monotonic with respect to the execution horizon, making a single constant horizon an unreliable deployment rule. We propose PACE (Phase-Aware Chunk Execution), a training-free test-time execution method that selects the execution horizon online from the predicted chunk itself. PACE exploits the phase-dependent kinematic structure of manipulation trajectories by identifying low-speed transition points in the predicted speed profile and using them as candidate replanning boundaries. Because PACE uses only the predicted action chunk, it is plug-and-play and requires no retraining or access to policy internals. We validate PACE through large-scale evaluations in both simulation and real-robot settings. On 50 RoboTwin2.0 tasks, PACE raises the average success rate from 57.8% to 64.2%. In real-robot experiments on bimanual ALOHA and single-arm Franka platforms, PACE improves the average task score from 60.7 to 77.7 and the average success rate from 50.7% to 70.4%. Ablations and rollout-level analyses show that PACE adapts execution horizons across manipulation phases, shortening near transitions while preserving longer execution during coherent motion.
comment: 21 pages, 7 figures, 6 tables. Preprint
DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning
We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.
PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation
Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.
comment: Under review, code will be available soon
A passive universal grasping mechanism based on an everting shell
A passive monolithic compliant grasping mechanism that works based on the eversion of an elastically deformable bistable shell is conceptualized. It comprises grasping arms made of beam segments that work in conjunction with the everting shell. The grasper is capable of picking up a stiff object of any shape up to a maximum size and weight. The bistable shell everts upon contact with the object to enable the grasping arms envelop the object forming an enclosure. The mechanism then stays in that configuration until it is actuated again to turn the shell back to its original configuration and thereby opening the enclosure to release the object. The stiffness of the arms decides the payload of the mechanism. The size of the arms decides the largest object that can be grasped and held. The arms have distributed compliance so that they can conform to the shape of the object without applying undue force on it.
Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction
Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.
ROG-Grasp: Root-Oriented Geometry for Robotic Grasping and Placement
Orientation-aware manipulation is essential in post-harvest agricultural processing, where produce must be grasped and placed in consistent configurations. This paper presents ROG-Grasp, a geometry-based robotic grasping and placement framework that estimates the produce orientation from root surface geometry using RGB-D perception. A YOLO-based root detector and point cloud plane fitting are used to infer the root normal, enabling stable grasp pose generation and orientation-constrained Cartesian motion planning. Experiments on tomatoes and onions demonstrate high success rates and stable execution time in both isolated and cluttered scenarios. Compared with vision-language-action (VLA) policies, the proposed method achieves more reliable and accurate grasp completion with faster execution. These results highlight the effectiveness of geometry-driven perception for practical orientation-controlled manipulation tasks. A video of our paper is available online https://youtu.be/Ir2UtGODdMo.
comment: Comments: 7 pages, 6 figures. Video: https://youtu.be/Ir2UtGODdMo
Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation ICRA 2026
Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.
comment: 9 pages, 4 figures. Accepted to ICRA 2026
Situation-Aware Interactive MPC Switching for Autonomous Driving
Autonomous driving in interactive traffic scenarios remains challenging because of the mutual influence among vehicles and the inherent uncertainty of surrounding agents. Several model predictive control (MPC) formulations have been proposed to address this challenge, each adopting a different model of inter-agent interaction. While higher-fidelity interaction models enable more intelligent behavior, they incur substantially greater computational cost. Since strong interactions arise only occasionally in real traffic, a practical strategy for balancing performance and computational overhead is to invoke an appropriate controller based on situational demands. To this end, we first conduct a comparative study to assess and hierarchize the interactive capabilities of different MPC formulations. Building on this hierarchy, we then develop a neural network-based classifier for situation-aware switching among these controllers. We demonstrate that, by invoking the most advanced interactive MPC only in rare but critical situations and relying on a basic MPC in the majority of situations, situation-aware switching substantially improves overall performance while significantly reducing computational load.
Semantic-Geometric Task Representations for Bimanual Manipulation from Human Demonstrations to Robot Action Planning
Learning structured task representations from human demonstrations is essential for bimanual manipulation, where action ordering, object involvement, and interaction geometry vary significantly across executions. A key challenge lies in jointly capturing the discrete semantic task structure and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. We introduce a semantic--geometric graph-based task representation that jointly encodes object identities, inter-object semantic relations, and per-object motion histories, via a Message Passing Neural Network (MPNN) encoder and a Transformer-based decoder. The encoder operates solely on the temporal scene graph, producing structured representations decoupled from action labels. The decoder then conditions on action-context to forecast future actions, associated objects, and object motions. This decoupling learns task-agnostic representations, enabling encoder reuse across embodiments through decoder-only finetuning on a small robot dataset. Across eleven bimanual tasks from two datasets, we find that the benefit of structured semantic--geometric representations over simpler sequence-based models grows with task variability in action ordering and object involvement. At deployment, a planner couples the action and motion predictions with learned Probabilistic Movement Primitives, achieving full task success on two real-robot bimanual tasks and outperforming graph ablations, Transformer, decoder-only, and finetuned vision-language model baselines.
comment: 9 pages, 7 figures, preprint
Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space
Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants under synchronized Gaussian error assumptions. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines.
ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors
Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat standard vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed quantity landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside Monte Carlo Localization (MCL), yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. To demonstrate perception-agnostic scalability, we evaluate ShelfAware across two domains. In a rigorously controlled mock retail environment, ShelfAware achieves a 97% global localization success rate, maintaining the highest tracking success (66%) across cart, wearable, and dynamic occlusion conditions. Furthermore, in a 3,500 sq. ft. operational grocery store leveraging an open-vocabulary vision pipeline, ShelfAware significantly outperforms both geometric and fixed-quantity semantic baselines. By modeling semantics distributionally and leveraging inverse proposals, ShelfAware resolves geometric aliasing, providing an infrastructure-free building block for mobile and assistive robots in dynamic real-world environments.
comment: 8 pages
See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation CVPR
Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.
comment: Suggested to CVPR Findings. https://tingjundai.github.io/SPRVLA/
SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes ICML 2026
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
comment: ICML 2026 Spotlight; Project page: https://scenesmith.github.io/
RoboBenchMart: Benchmarking Robots in Retail Environment
Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations, with items positioned at different heights, depths, and in close proximity. By targeting on the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. Using generated trajectories, we model a standard, realistic fine-tuning setup for current generalist VLAs and evaluate several state-of-the-art models. We find that they still struggle even on common retail tasks, indicating that these models are not yet truly general across domains. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools, and fine-tuned baseline models.
Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction
At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.
LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification
System identification, the process of deriving mathematical models of dynamical systems from observed input-output data, has undergone a paradigm shift with the advent of learning-based methods. Addressing the intricate challenges of data-driven discovery in nonlinear dynamical systems, these methods have garnered significant attention. Among them, Sparse Identification of Nonlinear Dynamics (SINDy) has emerged as a transformative approach, distilling complex dynamical behaviors into interpretable linear combinations of basis functions. However, SINDy's reliance on domain-specific expertise to construct its foundational 'library' of basis functions limits its adaptability and universality. In this work, we introduce a nonlinear system identification framework LeARN that transcends the need for prior domain knowledge by learning the library of basis functions directly from data. To enhance adaptability to evolving system dynamics under varying noise conditions, we employ a novel meta-learning-based system identification approach that utilizes a light-weight Deep Neural Network (DNN) to dynamically refine these basis functions. This not only captures intricate system behaviors but also adapts effectively to new dynamical regimes. We validate our framework on the Neural Fly dataset, showcasing its robust adaptation and generalization capabilities. Despite its simplicity, our LeARN achieves competitive dynamical error performance to SINDy. This work presents a step towards autonomous discovery of dynamical systems, paving the way for a future where machine learning uncovers the governing principles of complex systems without requiring extensive domain-specific interventions.
comment: This work has been accepted at the 34th Mediterranean Conference on Control and Automation (MED 2026)
A Unified Framework for Probabilistic Dynamic-, Trajectory- and Vision-based Virtual Fixtures
Probabilistic Virtual Fixtures (VFs) enable the adaptive selection of the most suitable haptic feedback for each phase of a task, based on learned or perceived uncertainty. While keeping the human in the loop remains essential, for instance, to ensure high precision, partial automation of certain task phases is critical for productivity. We present a unified framework for probabilistic VFs that seamlessly switches between manual fixtures, semi-automated fixtures (with the human handling precise tasks), and full autonomy. We introduce a novel probabilistic Dynamical System-based VF for coarse guidance, enabling the robot to autonomously complete certain task phases while keeping the human operator in the loop. For tasks requiring precise guidance, we extend probabilistic position-based trajectory fixtures with automation, allowing for seamless human interaction, geometry-awareness and optimal impedance gains. For manual tasks requiring very precise guidance, we also extend visual servoing fixtures with the same geometry-awareness and impedance behavior. We validate our approach on different robots, including an evaluation with expert users, showcasing operation modes, the ease of programming fixtures and lower interaction forces and favorable usability compared to a baseline.
comment: for the supplementary video, see https://youtu.be/eMl41ha7VJ4
Proactive-reactive detection and mitigation of intermittent faults in robot swarms
Intermittent faults are transient errors that sporadically appear and disappear. Although intermittent faults pose substantial challenges to reliability and coordination, existing studies of fault tolerance in robot swarms focus instead on permanent faults. One reason for this is that intermittent faults are prohibitively difficult to detect in the fully self-organized ad-hoc networks typical of robot swarms, as their network topologies are transient and often unpredictable. However, in the recently introduced self-organizing nervous systems (SoNS) approach, robot swarms are able to self-organize persistent network structures for the first time, easing the problem of detecting intermittent faults. To address intermittent faults in robot swarms that have persistent networks, we propose a novel proactive-reactive strategy to detection and mitigation, based on self-organized backup layers and distributed consensus in a multiplex network. Proactively, the robots self-organize dynamic backup paths before faults occur, adapting to changes in the primary network topology and the robots' relative positions. Reactively, robots use one-shot likelihood ratio tests to compare information received along different paths in the multiplex network, enabling early fault detection. Upon detection, communication is temporarily rerouted in a self-organized way, until the detected fault resolves. We validate the approach in representative scenarios of faulty positional data occurring during formation control, demonstrating that intermittent faults are prevented from disrupting convergence to desired formations, with high fault detection accuracy and low rates of false positives.
3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots
In this paper, we present a novel hybrid approach that combines Reinforcement Learning (RL) with Dynamic Window Approach (DWA) for adaptive 3D local navigation of high-degree-of-freedom robotic systems. Our method leverages sparse point cloud data to dynamically adjust both the motion and the shape of a deformable microrobot, enabling the system to navigate toward a goal in complex, constrained environments while maximizing the occupied volume. We evaluate our framework in a simulated vascular network. Experimental results, based on 1080 trials, indicate that integrating RL with a DWA-based local planner significantly enhances both deformation and navigation capabilities compared to pure RL and model-based methods. In particular, the proposed autonomous controller consistently achieves high deformation and near-perfect path completion during training and maintains robust performance in unseen scenarios. These findings highlight the potential of hybrid planning strategies for efficient and adaptive 3D navigation under sparse sensory conditions.
comment: Accepted for publication in the Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2026)
GIFT: Geometry-Induced Functional Transfer for Category-level Object Manipulation ICRA 2026
Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.
comment: 8 pages, 6 figures. ICRA 2026
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence
Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/
DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning ICRA 2026
Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and code are available on https://sites.google.com/view/dag-plan.
comment: ICRA 2026
MARFT: Multi-Agent Reinforcement Fine-Tuning
Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.
comment: 37 pages
Highly Deformable Proprioceptive Membrane for Real-Time 3D Shape Reconstruction
Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.
comment: 13 pages, 9 figures
Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments
Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from motion blur. However, their widespread adoption in robot learning is severely bottlenecked by the computational cost of simulating high-frequency event data during online training. In this work, we present Approximate Imitation Learning, a novel framework that fundamentally resolves this bottleneck, reducing policy training time for complex, agile drone flight from 52.44 hours to just 1.86 hours - a 28x computational speedup. Our key insight is to separate representation learning from policy search. We first leverage a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is fine-tuned through online interactions that rely solely on lightweight state information, completely eliminating the need to render events during the active policy search phase. This training paradigm drastically reduces development overhead and enables event-based control policies to scale to complex environments. Furthermore, our approach eliminates the reliance on standard cameras or intermediate representations during deployment, mapping events directly to control commands. In simulation, our method matches or exceeds the performance of standard imitation learning baselines that require full online event rendering. Finally, we successfully validate the framework in the real world, demonstrating that a policy trained via this ultra-efficient paradigm enables a quadrotor to fly through highly cluttered environments at remarkable speeds of up to 9.8 m/s.
RynnVLA-002: A Unified Vision-Language-Action and World Model
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.
RCM-ACT: Imitation Learning with Dynamic RCM Calibration for Autonomous Intraocular Foreign Body Removal
Intraocular foreign body removal demands millimeter-level precision in confined intraocular spaces, yet existing robotic systems predominantly rely on manual teleoperation with steep learning curves. To address the challenges of autonomous manipulation, particularly kinematic uncertainties from variable motion scaling and Remote Center of Motion (RCM) point variation, we propose RCM-ACT, an imitation learning framework for autonomous intraocular foreign body ring manipulation. Our approach integrates RCM dynamic calibration to resolve coordinate system inconsistencies caused by intraocular instrument variation and introduces the RCM-ACT architecture, which combines action chunking transformers with episode-level kinematic realignment. Trained solely on stereo visual data and instrument kinematics from expert demonstrations in an artificial eye model, RCM-ACT successfully completes ring grasping and positioning tasks without explicit depth sensing. Experimental validation demonstrates the successful implementation of end-to-end autonomy under uncalibrated microscopy conditions, achieving a mean 3-D Euclidean grasp deviation of 0.686 mm and 11/20 full-task successes. The results provide a viable framework for developing intelligent eye surgical systems capable of complex intraocular procedures.
TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions
This study introduces TRANS: Terrain-aware Reinforcement learning for Agile Navigation under Social interactions, a deep reinforcement learning (DRL) framework for quadrupedal social navigation over unstructured terrains. Conventional quadrupedal navigation typically separates motion planning from locomotion control, neglecting whole-body constraints and terrain awareness. On the other hand, end-to-end methods are more integrated but require high-frequency sensing, which is often noisy and computationally costly. In addition, most existing approaches assume static environments, limiting their use in human-populated settings. To address these limitations, we propose a two-stage training framework with three DRL pipelines. (1) TRANS-Loco employs an asymmetric actor-critic (AC) model for quadrupedal locomotion, enabling traversal of uneven terrains without explicit terrain or contact observations. (2) TRANS-Nav applies a symmetric AC framework for social navigation, directly mapping transformed LiDAR data to ego-agent actions under differential-drive kinematics. (3) A unified pipeline, TRANS, integrates TRANS-Loco and TRANS-Nav, supporting terrain-aware quadrupedal navigation in uneven and socially interactive environments. Comprehensive benchmarks against locomotion and social navigation baselines demonstrate the effectiveness of TRANS. Hardware experiments further confirm its potential for sim-to-real transfer.
Provably Safe Motion Planning Under Unknown Disturbances
We present a provably safe sampling-based motion planning algorithm for robotic systems affected by random disturbances of unknown distribution. We consider systems with linear or linearizable dynamics evolving in workspace with arbitrary-shaped obstacles subject to state and control constraints. Safety requirements are formulated as chance-constraints. Our approach leverages data from trajectories of the system to learn a Wasserstein ambiguity tube, i.e., a sequence of ambiguity sets, which contains the trajectory of the system's state distribution with high confidence. This ambiguity tube is then used in a probabilistically complete algorithm to grow a sampling-based motion planning tree that respects the constraints of the problem. We show that learning several lower-dimensional ambiguity tubes instead of a single high-dimensional one effectively reduces the conservatism and boosts scalability. Additionally, we design an efficient bandit-based validity checker that remarkably increases the empirical performance of our approach without sacrificing probabilistic completeness. Case studies show our algorithm finds valid plans in cluttered environments under strict safety thresholds, outperforming state-of-the-art methods.
Multiagent Systems
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
comment: 34 pages, 21 figures, 5 tables
Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems
Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.
comment: 13 pages, 4 appendix
Scaling Behavior of Single LLM-Driven Multi-Agent Systems
The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigates how the performance of a homogeneous MAS evolves as the number of agents increases, isolating the variable of collaboration from model or knowledge heterogeneity. We propose the Sequential Iterative Multi-Agent System (SIMAS) framework, a minimalist architecture centered on sequential inter-agent communication, to clearly observe scaling effects. Through extensive experiments across diverse tasks and model scales, we establish that MAS performance does not scale monotonically with agent count but follows a pattern of diminishing returns, governed by a trade-off between collaborative synergy and coordination overhead. Our findings reveal that effective MAS requires a sufficiently capable base LLM, that task type critically modulates the optimal agent count, and that collective intelligence is an emergent property contingent on strategic interaction design rather than a guaranteed outcome of agent plurality. The performance degradation stems coordination overhead rather than merely long-context failure, and the scaling tendency generalizes across interaction architectures like structured debate topologies. This work provides a foundational understanding of MAS scaling laws, offering practical guidance for designing efficient collaborative systems and challenging the prevailing assumption that more agents invariably lead to better performance.
MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation KDD 2026
Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large-scale, unstructured corpora where information is highly fragmented. Graph-based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment-level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory-based multi-agent system to ensure high-quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory-aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state-of-the-art baseline models with comparable efficiency. Our code is available at https://github.com/XMUDeepLIT/MemGraphRAG.
comment: Accepted by KDD 2026
State Machine Guided Multi-Relational Synthetic Data from Logs for Anomaly Detection
Software systems generate massive unstructured logs that record execution behavior, failures, and interactions across components, yet existing log anomaly detection methods treat these logs primarily as flat sequences of templates, overlooking the relational execution structure that governs how events co-occur and evolve over time. We propose a framework that discovers this hidden structure by recovering an execution state machine directly from logs and inducing a corresponding multi-table relational schema connecting traces, events, states, transitions, and parameters. This discovered state machine serves as a generative prior to produce realistic multi-relational synthetic data that preserves structural, temporal, and process constraints while amplifying rare but valid execution behaviors. We assess the fidelity of the generated data through constraint validation, distributional similarity, and process-level metrics, and demonstrate its usefulness by showing that augmenting real logs with the synthetic relational data significantly improves anomaly and bug detection on held-out real datasets compared to sequence-based baselines and naive oversampling. Our results show that execution logs implicitly encode a relational database governed by a latent state machine, and that recovering this structure enables principled synthetic data generation for robust and interpretable anomaly detection.
Proactive-reactive detection and mitigation of intermittent faults in robot swarms
Intermittent faults are transient errors that sporadically appear and disappear. Although intermittent faults pose substantial challenges to reliability and coordination, existing studies of fault tolerance in robot swarms focus instead on permanent faults. One reason for this is that intermittent faults are prohibitively difficult to detect in the fully self-organized ad-hoc networks typical of robot swarms, as their network topologies are transient and often unpredictable. However, in the recently introduced self-organizing nervous systems (SoNS) approach, robot swarms are able to self-organize persistent network structures for the first time, easing the problem of detecting intermittent faults. To address intermittent faults in robot swarms that have persistent networks, we propose a novel proactive-reactive strategy to detection and mitigation, based on self-organized backup layers and distributed consensus in a multiplex network. Proactively, the robots self-organize dynamic backup paths before faults occur, adapting to changes in the primary network topology and the robots' relative positions. Reactively, robots use one-shot likelihood ratio tests to compare information received along different paths in the multiplex network, enabling early fault detection. Upon detection, communication is temporarily rerouted in a self-organized way, until the detected fault resolves. We validate the approach in representative scenarios of faulty positional data occurring during formation control, demonstrating that intermittent faults are prevented from disrupting convergence to desired formations, with high fault detection accuracy and low rates of false positives.
MARFT: Multi-Agent Reinforcement Fine-Tuning
Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.
comment: 37 pages
Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration
With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.
When Identity Overrides Incentives: Representational Choices as Governance Decisions in Multi-Agent LLM Systems
Multi-agent systems built on large language models are increasingly deployed in strategic policy and governance settings, where agents representing stakeholders with conflicting interests must coordinate under shared constraints. These systems typically assign role-based personas to agents, describing their motivations and objectives. Whether agents with role-based identities follow explicit payoffs or their assigned roles in strategic decision-making remains untested. Here we show that assigning role-based personas suppresses payoff-aligned behavior in four-agent strategic games, shifting equilibrium attainment by up to 90 percentage points even when agents have complete payoff information. We test a 2x2 factorial design (persona presence x payoff visibility) across four models (Qwen-7B, Qwen-32B, Llama-8B, Mistral-7B), and 53 environmental policy scenarios with two equilibria: Tragedy of the Commons, where individual payoff dominates, and Green Transition, where collective payoff dominates. With personas present, all models reach near-zero Tragedy equilibrium in the Tragedy-dominant scenarios despite complete payoff information, and 100% of equilibria correspond to Green Transition. No model reaches Tragedy equilibrium by removing personas alone; only Qwen models reach 65-90% Tragedy equilibrium rates when personas are removed, and payoffs are made explicit. Three distinct behavioral profiles emerge: Qwen shifts equilibrium selection based on framing condition, Mistral increases response variance without reaching the Tragedy equilibrium, and Llama holds near-constant across all conditions. Representational choices in multi-agent LLM systems are governance decisions: persona assignment determines which equilibrium a simulation produces, independent of the underlying incentive structure.
comment: Accepted to ACM FAccT 2026
Systems and Control (EESS)
Lipschitz-Enforced Machine Learning Framework for Accelerating Transient Stability Analysis of Networked Grid-Interactive Inverters
The growing penetration of grid-connected inverters renders Transient Stability Analysis (TSA) increasingly challenging in modern power systems. Existing TSA methodologies encounter an intrinsic trade-off between accuracy and scalability when dealing with these networked inverter-based resources (IBRs). To bridge this gap, this paper proposes a Lipschitz-enforced machine learning framework that leverages Lipschitz continuity to restructure the transient stability certification mechanism. By replacing computationally intensive verification procedures with a deterministic and efficient algebraic check, the proposed method enables rigorous stability guarantees for complex multi-inverter systems, effectively bypassing the complexity limits of traditional analytical approximations. Validated on networked Grid-Forming (GFM) inverter systems, the proposed framework accelerates the training process by over 5 times compared to existing methods. Notably, the proposed framework substantially outperforms traditional transient stability analysis approaches (e.g., Linear Matrix Inequality and Sum-of-Squares methods) by capturing up to 30\% larger Regions of Attraction (ROA), effectively shattering the conservativeness bottleneck that has long constrained traditional analytical tools. This advancement provides a scalable and theoretically rigorous solution for the TSA of networked IBRs in modern power grids.
comment: 11 pages, 9 figures. Submitted to IEEE Transactions on Power Electronics
A Framework for Motion Planning with Temporal Logic Precedence Specifications via Augmented Graphs of Convex Sets
We present a framework for planning trajectories that avoid obstacles and satisfy logical precedence constraints expressed with a fragment of signal temporal logic (STL). Our approach models environments containing obstacles, keys, and doors, where collecting a key unlocks its associated door and potentially opens shorter paths to a goal. Based on an exact convex partitioning of the free space that encodes connectivity among convex free space, key, and door regions, we construct an augmented graph of convex sets (GCS) whose layered structure exactly encodes the key-door precedence logic. A shortest path in the augmented GCS simultaneously selects an optimal key collection sequence and computes an optimal continuous trajectory, providing an exact solution up to a finite Bézier curve parameterization.
Control of Flight
The main focus of this talk is to present mathematical fundamentals, state-of-the-art, technical challenges and open problems in control of flight for atmospheric vehicles, such as aircraft and other aerial platforms. Reduced order modeling and flight simulation key features for control applications will be discussed. The emphasis is on the theoretical and engineering aspects of creating and transitioning to practice guidance and flight control systems with guarantees of closed-loop stability, robustness and performance.
comment: Charts
State-Space Modelling and Analysis
Control science is a core representative of the third industrial revolution and is so important to modern civilization. Control systems are the main subject of control science and may involve many aspects of consideration, such as hardware consideration, software consideration, operation consideration, maintenance consideration, economy consideration, society consideration. However, besides all such aspects of consideration, one aspect that is most essential to the control system is methodology consideration in mathematical sense, knowledge on which is what we refer to as control theory. Besides its importance from the mathematical perspective, control theory is even more charming as it is deeply rooted in practical applications. Charms of control theory consist in both know-why and know-how and it is the fusion of control theory and practical applications that highlights such charms. Control theory for practical applications, especially when somewhat with so-called ``advanced'' flavour, involves several fundamental aspects. This article introduces the State-Space Modelling and Analysis aspect of Advanced Control Theory for Practical Applications [1,2].
Recursive Identification of EIV-ARX Models for Time Varying SISO Processes
This paper proposes a recursive algorithm, rARX-DIPCA, for identifying errors-in-variables autoregressive models with exogenous input (EIV-ARX), for tracking time-varying SISO processes. Building on a recently developed recursive iterative PCA method, the proposed algorithm recursively updates model parameters and noise variances as new measurements arrive, without storing historical data beyond a specified lag window. The method enables real-time adaptation to sensor degradation, and changes in model coefficients. The algorithm simultaneously identifies process order, time delay, and noise variances while maintaining computational efficiency through online covariance updates. Simulation studies on benchmark systems demonstrate effective tracking performance and practical applicability.
comment: Accepted for publication in the 23rd IFAC World Congress (IFAC WC 2026), Busan, Republic of Korea
Handling Control System Optimality
Control science is a core representative of the third industrial revolution and is so important to modern civilization. Control systems are the main subject of control science and may involve many aspects of consideration, such as hardware consideration, software consideration, operation consideration, maintenance consideration, economy consideration, society consideration. However, besides all such aspects of consideration, one aspect that is most essential to the control system is methodology consideration in mathematical sense, knowledge on which is what we refer to as control theory. Besides its importance from the mathematical perspective, control theory is even more charming as it is deeply rooted in practical applications. Charms of control theory consist in both know-why and know-how and it is the fusion of control theory and practical applications that highlights such charms. Control theory for practical applications, especially when somewhat with so-called ``advanced'' flavour, involves several fundamental aspects. This article introduces the Handling Control System Optimality aspect of Advanced Control Theory for Practical Applications [1,2].
Hashprice modulates the electricity demand response of Bitcoin miners
Large, fast-controllable loads such as Bitcoin mining facilities are increasingly viewed as potential sources of flexibility in modern power systems, yet the conditions under which this flexibility is realized remain incompletely understood. Using the Texas power market as an empirical setting, we examine how Bitcoin-mining load responds to two distinct electricity-sector cost channels: contemporaneous wholesale electricity prices and incentives created by coincident-peak-based transmission charges. We find that mining load responds to both cost channels in a manner consistent with miners operating around a breakeven point. At the aggregate level, we observe that mining load decreases as electricity-sector costs rise, but the strength of this response depends on hashprice, a measure of expected mining revenue from the crypto-financial sector. When hashprice is higher, aggregate load responsiveness is weaker. This mechanism is especially evident in the wholesale-price response. Mining load remains largely online at low prices and begins to decline only when electricity costs become large relative to expected mining revenue, with higher hashprice shifting the implied curtailment threshold toward higher wholesale prices. These findings indicate that Bitcoin-mining demand response to electricity-sector costs is economically state-dependent and shaped by revenue conditions in the crypto-financial sector. Treating such loads as stable demand-response resources may therefore overstate available grid flexibility, with implications for power-system planning, market design, and reliability assessment.
comment: This manuscript has supplementary information in the accompanying PDF
Edge-Based QoS-Aware Adaptive Task Placement: A Closed-Loop Control in Multi-Robot Systems
Multi-robot systems (MRS) increasingly offload compute-intensive perception tasks to edge nodes to meet strict time-sensitive Quality-of-Service (QoS) constraints. However, static task orchestration on a shared edge node can severely degrade QoS due to network latency, jitter, and edge-resource contention. We present a pilot edge-centric MRS testbed using Raspberry Pi nodes to evaluate a camera-to-manipulator pipeline under three modes: local execution, static offloading, and a QoS-aware Adaptive Task Placement (ATP) controller. ATP scores candidate placements using a multi-metric cost (normalized latency, CPU utilization, and switching overhead) over two-second control windows. The closed-loop visual servoing testbed is instrumented with sub-millisecond clock synchronization, network emulation, and detailed monitoring of multiple metrics across nodes to capture realistic jitter. Experimental results under compute-stress and network-fault scenarios show that static edge offloading reduces on-board CPU load but amplifies tail latency and deadline misses. In contrast, the QoS-aware ATP controller, by switching task placement based on measured latency and utilization thresholds, consistently lowers deadline violations and tail latency. Overall, the results position ATP as a practical edge-side control primitive for MRS and concrete design guidelines for Cloud-Edge Robotics deployments within the broader cloud-fog automation, while motivating QoS-aware multi-objective workload orchestration for industrial cyber-physical systems.
comment: 6 pages, 2 figure, 1 algorithm, accepted as a regular paper on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia
Wire-Level Interrupt-to-Decision Latency of On-Sensor MLC versus Host Inference on the NVIDIA Jetson Orin Nano: A Pre-Registered Measurement Study
The Machine Learning Core (MLC) embedded in the STMicroelectronics LSM6DSOX IMU is widely cited as a low-latency alternative to host-side inference, yet wire-level decision-delivery latency is rarely measured. Using a Saleae Logic Pro 8 logic analyzer on an NVIDIA Jetson Orin Nano, we measured interrupt-to-decision latency (sensor INT1 edge to host decision GPIO) for three pipelines (a host-side decision-tree classifier, the standard MLC bank-switch read protocol, and an MLC binary-fast variant) under idle, I2C bus contention, and CPU stress. The protocol was pre-registered with 12 externally-timestamped Zenodo amendments before confirmatory data collection (4,770 of 4,860 trials included, 98.15%, across nine cells). The host pipeline exhibits lower median latency than the MLC pipeline under all conditions: 321.7 vs 681.5 us at idle (2.1x faster) and 574.5 vs 1,325.4 us under I2C contention (2.3x faster). The three-transaction I2C read protocol, not the silicon's classification, is the dominant latency contributor. We additionally characterize a reproducible 706.5 ms MLC decision cadence that bounds full stimulus-to-decision latency. Code, data, and pre-registration: github.com/akulswami/sensor-mlc-latency.
comment: 4 pages, 1 figure, 1 table. Submitted to IEEE Sensors Letters
PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation
Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.
comment: Under review, code will be available soon
Like Uber or Like Buses? Economic Feasibility Analysis of UAM for Airport Access
The airport access use case is a promising early-stage application for Urban Air Mobility (UAM). Understanding the operational paradigm of UAM at airports is crucial for making equitable and effective regulatory and management decisions. A central open question is whether UAM will be integrated into the airport transportation network as a conventional scheduled transit service, such as subways and rail, or as a Transportation Network Company (TNC) characterized by dynamic supply-demand matching. In this paper, we propose a two-stage framework for conducting an economic feasibility analysis of UAM networks. In the first stage, we introduce a joint-supply-demand variable pricing problem to evaluate the impact of dynamic pricing on UAM operations. This model uses a binary logit formulation to capture the trade-off between travel time advantages and fare levels. In the second stage, the determined demand is used as input for the Electric Urban Air Mobility Vehicle Routing Problem with Non-linear Charging Time (eUAMVRP-NL), which optimizes fleet scheduling and charging decisions to derive operating revenue and cost estimates. We apply this framework to a case study of the Los Angeles International Airport (LAX) access market with an eight-spoke vertiport network. Our results indicate that UAM operations benefit significantly from TNC-like management; a variable pricing policy can increase operating profits by more than 100\% compared to fixed-pricing schemes. Furthermore, we identify economies of stage length in longer UAM flights.
Traffic Characterization of Event-Triggered Control Systems: A Geometric-Algebraic Perspective
This paper characterizes the triggering behaviors of event-triggered control systems from a geometric-algebraic perspective. We first model the feasibility of inter-event time transition relations as a nonconvex quadratic constraint satisfaction problem and reformulate it as an equivalent linear cone problem, which provides a clearer geometric description of the feasible region, making subsequent analysis more reliable. Building on this formulation, we establish necessary and sufficient conditions that rigorously determine whether a given transition relation is feasible. Based on this condition, we propose an algorithm that computes the set of all feasible transition relations. Numerical simulations further demonstrate how the feasibility of specific transitions evolves with the control parameter σ, with visualizations of the feasible state space offering intuitive insight into parameter selection and system design.
comment: 6 pages, 5 figures. Accepted by the 2026 American Control Conference (ACC 2026)
Stochastic Analysis of Cybersecurity Defense Strategies Under Single Attack Scenario
This research presents a novel stochastic framework for proactive cybersecurity defense timing under a single attack scenario. The approach models the defense process as a continuous observation mechanism in which the defense instant and the subsequent observation slot follow independent exponential distributions. Laplace-Carson transforms combined with first-excess theory yield the joint detection function that brackets the attack moment. Marginalization under Markovian Poisson arrivals then produces the probability density of the defense moment and conditional expectations of pre-attack and post-attack observation times. These closed-form results enable quantitative assessment of defense timing sensitivity to threat intensity and support precise calibration of observation parameters for low-latency proactive measures. Major contributions include the explicit derivation of marginal distributions and expected values, visualization of defense moment density, and the bridging of stochastic duel methodology with practical cybersecurity applications.
comment: Target to submit an international journal
Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction
Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.
State-Space Neural Network with Ordered Variance for Model Order Determination
This paper addresses the problem of identifying a nonlinear state-space model, along with an adequate model order, from a given input-output training dataset. To this end, a novel framework, termed state-space neural network with ordered variance (SSNNO), is proposed. In SSNNO, the state variables are ordered according to their variances computed using the training data. This ordering is achieved by introducing a variance-regularization term into the loss function used for SSNNO training and it facilitates a distinction between significant states, which exhibit high variance from the other residual states with near-zero variance. The number of significant states is indicative of a suitable model order. The variance-regularization mechanism is designed to minimize the number of significant state variables, thereby promoting a minimal order of the identified state-space model without significantly compromising its prediction accuracy. A systematic procedure is then introduced to obtain a reduced-order state-space model from the trained SSNNO, yielding a reduced-order SSNNO (R-SSNNO). The existence of an SSNNO with variance-ordered states, based solely on input-output data, as well as an upper bound on its output prediction error, are formally established. A practical and robust method is proposed for ensuring variance-ordered states in an SSNNO, even when the network is trained using local optimization algorithms. The effectiveness of the proposed method for identification of nonlinear state space models is demonstrated through simulation studies on a nonlinear continuous stirred-tank reactor process. The identified model is further used for state estimation and prediction in a model predictive control implementation.
comment: 11 pages, 3 figures
Discovering Nonlinear Static Relationships in Unlabeled Dataset using Autoencoder with Ordered Variance
This paper presents an autoencoder with ordered variance (AEO), in which the conventional reconstruction loss is augmented by a variance-based regularization term that promotes an ordered structure within the latent space. In this structure, the latent variables are ordered by their variance computed over the training data, facilitating systematic determination of the latent space dimensionality. The AEO is further extended using residual networks, resulting in a ResNet-based AEO (RAEO). Both AEO and RAEO green lead to discovery of nonlinear relationships among variables in unlabeled datasets, thereby enabling unsupervised static model extraction. Theoretical contributions include formal guarantees on the ordering of latent variances. The practical utility of the framework is demonstrated through its application to the identification of nonlinear steady-state models and their use in real-time optimization, with a continuous stirred tank reactor process serving as a representative case study.
comment: 14 pages, 5 figures
Experimental Demonstration of a Decentralized Electromagnetic Formation Flying Control Using Alternating Magnetic Field Forces
Electromagnetic formation flying (EMFF) is challenging due to the complex coupling between the electromagnetic fields generated by each satellite in the formation. To address this challenge, this article uses alternating magnetic field forces (AMFF) to decouple the electromagnetic forces between each pair of satellites. The key idea of AMFF is that a pair of alternating (e.g., sinusoidal) magnetic moments results in a nonzero time-averaged interaction force if and only if those alternating magnetic moments have the same frequency. Hence, the approach in this article is to drive each satellite's electromagnetic actuation system with a sum of sinusoids, where each frequency is common to only a pair of satellites. Then, the amplitudes of each sinusoid are modulated (i.e., controlled) to achieve the desired forces between each pair of satellites. The main contribution of this article is an experimental demonstration of 3-satellite decentralized closed-loop EMFF using AMFF. To the authors' knowledge, this is the first demonstration of AMFF with at least 3 satellites in open or closed loop. This is noteworthy because the coupling challenges of EMFF are only present with more than 2 satellites, and thus, a formation of at least 3 is necessary to evaluate the effectiveness of AMFF. The experiments are conducted on a ground-based testbed consisting of 3 electromagnetically actuated satellites on linear air tracks. The closed-loop experiments demonstrate decentralized EMFF with AMFF where the maximum steady-state formation error is less than $\pm $0.01 m and the settling time is less than 30 s. These experiments validate the decoupling of intersatellite forces through frequency-multiplexed AMFF. The closed-loop experimental results are compared with the behavior of numerical simulations.
comment: Preprint accepted for publication in Aerospace Science and Technology (Elsevier)
Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation ICRA 2026
Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.
comment: 9 pages, 4 figures. Accepted to ICRA 2026
Situation-Aware Interactive MPC Switching for Autonomous Driving
Autonomous driving in interactive traffic scenarios remains challenging because of the mutual influence among vehicles and the inherent uncertainty of surrounding agents. Several model predictive control (MPC) formulations have been proposed to address this challenge, each adopting a different model of inter-agent interaction. While higher-fidelity interaction models enable more intelligent behavior, they incur substantially greater computational cost. Since strong interactions arise only occasionally in real traffic, a practical strategy for balancing performance and computational overhead is to invoke an appropriate controller based on situational demands. To this end, we first conduct a comparative study to assess and hierarchize the interactive capabilities of different MPC formulations. Building on this hierarchy, we then develop a neural network-based classifier for situation-aware switching among these controllers. We demonstrate that, by invoking the most advanced interactive MPC only in rare but critical situations and relying on a basic MPC in the majority of situations, situation-aware switching substantially improves overall performance while significantly reducing computational load.
Approximations and Learning for Continuous State and Action MDPs under Average Cost Criteria
In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution.
Accurate Small-Signal Modeling of Digitally Controlled Buck Converters with ADC-PWM Synchronization
Digital control has become increasingly widespread in modern power electronic converters. When acquiring feedback signals such as the inductor current, synchronizing the analog-to-digital converter (ADC) with the digital pulse-width modulator (DPWM) is commonly employed to accurately track their steady-state average. However, the small-signal implications of such synchronization have not been investigated. This paper presents an exact small-signal model for digitally controlled buck converters operating in forced continuous-conduction mode (FCCM) under constant-frequency current-mode control, explicitly accounting for DPWM-ADC synchronization. Using a sampled-data framework, the proposed model captures all sideband effects introduced by the sampling process, yielding precise predictions of both analog and digital loop gains, even at frequencies beyond the switching and sampling frequencies. Both asymmetrical and symmetrical carrier modulations are considered. Furthermore, the digital loop gain is derived in closed form using the modified z-transform, enabling low-complexity compensator design and stability assessment. Within this framework, the analog loop gain can be directly obtained from the digital loop gain, thereby eliminating the need for computationally intensive infinite series evaluations. The validity of the proposed model is confirmed through both simulation and experimental results.
Proactive-reactive detection and mitigation of intermittent faults in robot swarms
Intermittent faults are transient errors that sporadically appear and disappear. Although intermittent faults pose substantial challenges to reliability and coordination, existing studies of fault tolerance in robot swarms focus instead on permanent faults. One reason for this is that intermittent faults are prohibitively difficult to detect in the fully self-organized ad-hoc networks typical of robot swarms, as their network topologies are transient and often unpredictable. However, in the recently introduced self-organizing nervous systems (SoNS) approach, robot swarms are able to self-organize persistent network structures for the first time, easing the problem of detecting intermittent faults. To address intermittent faults in robot swarms that have persistent networks, we propose a novel proactive-reactive strategy to detection and mitigation, based on self-organized backup layers and distributed consensus in a multiplex network. Proactively, the robots self-organize dynamic backup paths before faults occur, adapting to changes in the primary network topology and the robots' relative positions. Reactively, robots use one-shot likelihood ratio tests to compare information received along different paths in the multiplex network, enabling early fault detection. Upon detection, communication is temporarily rerouted in a self-organized way, until the detected fault resolves. We validate the approach in representative scenarios of faulty positional data occurring during formation control, demonstrating that intermittent faults are prevented from disrupting convergence to desired formations, with high fault detection accuracy and low rates of false positives.
Optimal transmission expansion modestly reduces decarbonization costs of U.S. electricity
Major government studies and policy reports project that substantial expansion of interregional transmission will be needed to integrate clean energy and ensure reliability in decarbonized power systems. Using the open-source Switch capacity expansion model with detailed representation of existing U.S. generation and transmission infrastructure, solar, wind, and storage resources, and hourly operations, we evaluate the role of interregional transmission across least-cost, carbon-priced, and zero-emissions scenarios for 2050. An optimal nationwide plan would more than triple interregional transmission capacity, yet this reduces the cost of a zero emissions system by only 7% relative to relying on existing interregional transmission, as storage, solar and wind siting, and nuclear generation serve as close substitutes. Regional cost and rent effects vary, with transmission generally favoring wind and hydrogen resources over solar and batteries. Sensitivity analysis shows diminishing returns: one-fifth of the benefits of full expansion can be achieved with one-twelfth of the added capacity, while cost reductions for batteries and hydrogen provide comparable or greater system savings than interregional transmission. Upgrading existing interregional corridors with advanced conductors roughly doubling capacity per link at half the cost of new builds reduces system costs by only 1.6%, suggesting that reconductoring benefits are modest and that realizing their full potential likely requires pairing with new connections on key corridors or complementary reductions in battery costs. These results suggest that while substantial transmission expansion is economically justified, a diverse set of flexibility resources can substitute for large-scale grid build out, and the relative value of transmission is highly contingent on technological and cost developments.
comment: 30 pages, 10 figures, 1 table in main paper. Additional 11 pages including 7 additional figures and one table in the appendix
GNN-Enabled Robust Hybrid Beamforming with Score-Based CSI Generation and Denoising
Accurate Channel State Information (CSI) is critical for Hybrid Beamforming (HBF) tasks. However, obtaining high-resolution CSI remains challenging in practical wireless communication systems. To address this issue, we propose to utilize Graph Neural Networks (GNNs) and score-based generative models to enable robust HBF under imperfect CSI conditions. Firstly, we develop the Hybrid Message Graph Attention Network (HMGAT) which updates both node and edge features through node-level and edge-level message passing. Secondly, we design a Bidirectional Encoder Representations from Transformers (BERT)-based Noise Conditional Score Network (NCSN) to learn the distribution of high-resolution CSI, facilitating CSI generation and data augmentation to further improve HMGAT's performance. Finally, we present a Denoising Score Network (DSN) framework and its instantiation, termed DeBERT, which can denoise imperfect CSI under arbitrary channel error levels, thereby facilitating robust HBF. Experiments on DeepMIMO urban datasets demonstrate the proposed models' superior generalization, scalability, and robustness across various HBF tasks with perfect and imperfect CSI.
Communication-Induced Bifurcation and Collective Dynamics in Power Packet Networks: A Thermodynamic Approach to Information-Constrained Energy Grids
This paper investigates the nonlinear dynamics and phase transitions in power packet network connected with routers, conceptualized as macroscopic information-ratchets. In the emerging paradigm of cyber-physical energy systems, the interplay between stochastic energy fluctuations and the thermodynamic cost of control information defines fundamental operational limits. We first formulate the dynamics of a single router using a Langevin framework, incorporating an exponential cost function for information acquisition. Our analysis reveals a discontinuous (first-order) phase transition, where the system adopts a strategic abandon of regulation as noise intensity exceeds a critical threshold $D_c$. This transition represents a fundamental information-barrier inherent to autonomous energy management. Here, we extend this model to network configurations, where multiple routers are linked through diffusive coupling, sharing energy between them. We demonstrate that the network topology and coupling strength significantly extend the bifurcation points, with collective resilient behaviors against local fluctuations. These results provide a rigorous mathematical basis for the design of future complex communication-energy network, suggesting that the stability of proposed systems is governed by the synergistic balance between physical energy flow and the thermodynamics of information exchange. It will serve to design future complex communication-energy networks, including internal energy management for autonomous robots.
comment: 8 pages, 6 figures
Provably Safe Motion Planning Under Unknown Disturbances
We present a provably safe sampling-based motion planning algorithm for robotic systems affected by random disturbances of unknown distribution. We consider systems with linear or linearizable dynamics evolving in workspace with arbitrary-shaped obstacles subject to state and control constraints. Safety requirements are formulated as chance-constraints. Our approach leverages data from trajectories of the system to learn a Wasserstein ambiguity tube, i.e., a sequence of ambiguity sets, which contains the trajectory of the system's state distribution with high confidence. This ambiguity tube is then used in a probabilistically complete algorithm to grow a sampling-based motion planning tree that respects the constraints of the problem. We show that learning several lower-dimensional ambiguity tubes instead of a single high-dimensional one effectively reduces the conservatism and boosts scalability. Additionally, we design an efficient bandit-based validity checker that remarkably increases the empirical performance of our approach without sacrificing probabilistic completeness. Case studies show our algorithm finds valid plans in cluttered environments under strict safety thresholds, outperforming state-of-the-art methods.
Synthesizing Neural Network Controllers with Closed-Loop Dissipativity Guarantees
This paper presents a method to synthesize neural network controllers to maximize reward subject to the hard constraint that the feedback system of plant and controller be dissipative, certifying requirements such as stability and $L_2$ gain bounds. It considers nonlinear and uncertain plants, modeled as the interconnection of a linear time-invariant (LTI) system and an uncertainty block, which incorporates nonlinearities. The uncertainty of the plant and the activation functions of the neural network are both described using integral quadratic constraints (IQCs). First, a dissipativity condition is derived for uncertain LTI systems. Second, this condition is used to construct a linear matrix inequality (LMI) which can be used to synthesize neural network controllers. Finally, this convex condition is used in a projection-based training method to synthesize neural network controllers with dissipativity guarantees. Numerical examples on an inverted pendulum and a flexible rod on a cart are provided to demonstrate the effectiveness of this approach.
comment: Accepted to the journal Automatica, 17 pages, 9 figures
Robotics
Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin
We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.
Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning
As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.
IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving
End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.
comment: 20 pages, 5 figures
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making
Reasoning-based robotic policies using large language and vision-language models achieve strong semantic planning capabilities but mostly suffer from a high inference latency that limits practical real-time deployment. In this work, we observe that robotic reasoning workloads contain substantial temporal redundancy, where consecutive observations frequently produce identical actions and subgoals. Based on this insight, we present REIS, a human cognition inspired robotic decision-making framework that minimizes unnecessary reasoning while preserving semantic adaptability. REIS combines lightweight scene gating, KV-steered affordance routing, and deliberative reasoning to accelerate robotic control under embodied constraints. Experiments on ALFRED, and real-world robotic tasks demonstrate that REIS significantly suppresses reasoning overhead while maintaining competitive task performance.
comment: 19 pages
Actuator-Aware Inverse Kinematics with Joint-Limit Admissibility for Torque-Controlled Redundant Robots
This paper proposes actuator-aware inverse kinematics for torque-controlled redundant robots under joint-limit constraints. In the considered architecture, the inverse-kinematic output is not merely a purely kinematic joint-velocity command; it is the required joint velocity supplied to a downstream torque-level controller. Therefore, a small commanded task residual may not necessarily improve realized motion. The proposed method formulates a convex quadratic programming problem whose decision variable is the joint-level required velocity. Control barrier function style bounds impose reference-level joint-limit admissibility, while the task equation is handled through a penalized slack variable. Redundancy is resolved using a controller-compatibility objective that accounts for previous-command consistency and actuator torque-capacity weighting. The method is independent of the particular torque-level controller and can serve as an intermediate IK layer between an endpoint trajectory and a redundant robot controller. Experiments on a virtual-decomposition-controlled seven-degree-of-freedom upper-limb exoskeleton compare the method with standard inverse-kinematic baselines and a constrained task-preserving quadratic programming baseline. The results indicate lower limit-pushing commands, bounded admissible required velocities, and improved realized task behavior in the tested trajectory, without modifying the downstream controller.
Shaft-integrated Force Sensing with Transformer-based Dynamics Compensation for Telesurgery
Robot-Assisted Minimally Invasive Surgery (RAMIS) enhances surgeon dexterity, with newer platforms leveraging haptic feedback to further improve performance. Such force information has broader potential to inform performance assessment, tactile localization, and surgical autonomy. This motivates the need for accessible approaches to integrating force sensing into RAMIS tools. This work presents a method for integrating a six-axis commercial force sensor into the distal end of a standard cable-driven surgical instrument, enabling end-effector force measurement while preserving the original mechanical functionality of the device. The proposed design emphasizes reproducibility and accessibility for research applications, requiring no specialized manufacturing tools. A transformer neural network integrates force sensor measurements with robot state information to aid estimation of applied forces at the end-effector, compensating for internal cable forces arising from actuation. Our proposed approach achieved normalized errors below 6%, and generalized to unseen conditions better than purely proximal data-driven sensing approaches. High internal cable forces caused sensor saturation and reduced axial force observability, which can degrade performance along the tool's major axis and under higher load conditions. Given current levels of performance, the balance of system integrability and performance enables applications and research into timely topics of haptic feedback, skill assessment, and force-informed autonomy in RAMIS. Videos and code are available at https://enhanced-telerobotics.github.io/shaft force sensing.
comment: The paper was accepted by IEEE Transactions on Medical Robotics and Bionics in May 2026
Triangle Splatting SLAM
We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.
comment: 26 pages, 11 figures
Adaptive Artificial Time-Delay Control with Barrier Lyapunov Constraints for Euler-Lagrange Robots
This paper addresses the challenge of simultaneously compensating for state-dependent uncertainties and enforcing time-varying state constraints in Euler-Lagrange systems, a common requirement in robotics that remains underserved by existing control designs. A novel adaptive control framework is developed that combines an artificial time-delay-based uncertainty estimation strategy, also known as time-delay estimation, with a barrier Lyapunov function to enforce constraint-aware control design. Specifically, a state-dependent upper bound on the time-delay estimation approximation error is analytically formulated, and an adaptive law is constructed to estimate its parameters online, enabling real-time state-dependent uncertainty compensation without relying on prior model knowledge. To ensure constraint compliance, the barrier Lyapunov function-based controller enforces time-varying bounds on both position and velocity. The resulting architecture is provably stable via Lyapunov analysis. Experimental results on a five-degree-of-freedom robotic manipulator validate the framework's capability, compared with the state of the art, in maintaining strict adherence to safety-critical constraints under dynamic uncertainties.
Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely
Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.
comment: Preprint
LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting
Autonomous robots in unknown indoor environments require both reliable collision avoidance and object-level understanding. Classical representations such as TSDF support safe planning but lack semantics, while photorealistic methods like Gaussian Splatting (GS) provide rich appearance yet suffer from soft geometry, limiting precise obstacle avoidance. We present LiftNav, a hybrid navigation framework built on GSFusion's TSDF+GS dual map, augmented with a real-time pipeline of YOLO-based detection, TSDF-based 3D lifting, and B-spline trajectory optimization. This design enables flexible semantic navigation without dense 3D embeddings. We further introduce a hinge-loss-based collision penalty that improves trajectory smoothness and safety. We evaluate our approach in a simulation using the Replica dataset. Compared against a state-of-the-art radiance field baseline we show a 100% feasibility rate and shorter trajectories.
Haptic Sorter: A Unified Planning Framework for Online Shape Estimation and Real-Time Pose Inference
Robotics manipulation usually assumes that the shape and pose of the object are known to the robot prior to motion planning. However, precise geometric information is not always available in practice, and pose inference suffers from sensor uncertainties and view occlusion. In this work, we propose a unified model-based geometric framework integrating robotic haptic perception, modeling, and manipulation planning. Our novelties involve: \textit{i)} Introducing Bayesian Optimization (BO) to guide the haptic exploration for object shape inference, where superellipses are used to approximate geometric boundary; \textit{ii)} Adaptive formulation of manipulation potential encoding object geometry for quasi-static robot-object interaction; \textit{iii)} Proposing an online Ordinary Differential Equation (ODE) for real-time pose inference based on model prediction and tactile feedback. We deploy our system on a 2D robotic sorting task, and vary object geometries to validate the robustness and generalizability of our framework in both simulation and a real-world multi-arm setup.
Learning Terrain-Aware Whole-Body Control for Perceptive Legged Loco-Manipulation
Legged manipulators integrate exceptional terrain adaptability along with mobile manipulation capabilities, which make them highly promising for deployment in human-centric environments. By coordinating the control of both legs and arms, a whole-body controller can significantly expand the operational workspace of legged manipulators. However, many existing whole-body controllers primarily depend on proprioception and do not incorporate the critical exteroception required for effective terrain topology perception. This limitation can hinder their ability to adapt to varying environmental conditions and navigate complex terrains effectively. In this paper, we introduce TA-WBC, a terrain-aware whole-body control framework for legged manipulators, which features a novel RL-based unified policy tailored to whole-body loco-manipulation tasks in various terrains. Specifically, we employ a hybrid exteroception encoder to extract terrain features, providing an essential basis for the robot to proactively adapt posture and footholds. Furthermore, to facilitate stable cross-terrain loco-manipulation, we propose a novel end-effector sampling method based on the foot contact plane, decoupling manipulation target from base fluctuations. Moreover, a dual-policy distillation module is introduced to integrate expansive whole-body motion with terrain adaptability without catastrophic forgetting. The simulation and real-world experiments validate the robustness of our proposed controller, which leads to a larger reachable space, less tracking error, and reduced unexpected stumbles. This unified policy highlights the promising capabilities of legged manipulators in performing loco-manipulation tasks across complex terrains.
Surface Constraint Policy for Learning Surface-Constrained and Dynamically Feasible Robot Skills
Diffusion-based imitation learning methods have driven rapid progress in robot dexterous manipulation tasks. However, they have limitations when applied to tasks that involve complex free-form surface constraints because of their lack of explicit surface geometry constraint modeling and the dynamic feasibility issue, resulting in stochastic action generation that fails to achieve reliable surface alignment and maintain stable contact. To address these limitations, we propose a novel surface constraint policy (SCP) for generating robot actions that satisfy free-form surface constraints on the basis of human demonstrations and real-time visual observations. First, the surface geometry constraint is encoded using a two-dimensional weighted Gaussian kernel function that is derived from demonstrations. Building on the encoded surface geometry constraints, the diffusion-based policy is used to infer task-level action intentions from multimodal sensory inputs, including visual observations and robot state feedback. These intentions are further transformed into surface-constrained dynamic movement primitives (DMPs) through a similarity-based action mapping method, thereby enabling smooth and compliant motion execution. The SCP achieves generation of structured surface geometric intent and dynamically admissible actions. The proposed method is validated on multiple surface manipulation tasks and compared with existing techniques. The experimental results demonstrate superior task success rates and contact stability under surface constraints.
AR Forcing: Towards Long-Horizon Robot Navigation World Model
The diffusion based robot navigation world models are typically trained using parallel supervision, while autoregressive inference is employed during path planning. This results in a distribution shift between training and inference, which destabilizes the performance over long-horizon prediction. We propose AR Forcing, an autoregressive training strategy, which integrates the standard diffusion loss into the autoregressive training loop. At each step, the model uses its own predictions to update the context and optimize the single step noise prediction objective, thereby explicitly exposing the model to the inference state distribution during training. Our method does not require additional discriminators or distribution-matching losses, retains the original diffusion framework and sampler, and is easy to integrate. Experiments on multi-domain navigation datasets (RECON, SCAND, HuRoN, TartanDrive) show that compared with strong baselines, AR Forcing improved the consistency of generated images during long-horizon navigation and the accuracy of predicted trajectories, enhancing robustness of the model in complex known and unknown environments. We will release the code soon.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.
comment: 14 pages, 2 figures
Before Parc Fermé: RL-Time Pruning for Efficient Embodied LLMs in Autonomous Driving
Embodied Large Language Models (LLMs) are increasingly used as reasoning modules in robotic control pipelines to improve human-robot interaction, but their memory and generation latency make real-time deployment difficult. Pruning can reduce these costs, but for controllers that undergo multiple pre- and post-training phases, the crucial question is not only how much to prune, but when pruning should occur. In this work, we propose Before Parc Fermé (BPF), a pruning strategy performed during RL that compresses embodied LLM controllers while they are still being optimized for closed-loop behavior. This allows pruning decisions to account for the task-specific supervision and closed-loop feedback that shape the final controller. We propose two variants: BPF-RL, which performs iterative pruning during RL by removing part of the model at predefined training intervals, and BPF-SFT/RL, which first prunes part of the model structure during SFT and then further compresses it during RL using the same iterative strategy as BPF-RL until the target pruning ratio is reached. We evaluate BPF on RobotxR1, an LLM-based autonomous-driving control pipeline, using an established LLM pruning framework (LLM-Pruner), and compare it against post-training pruning, post-training pruning with RL recovery, SFT-stage pruning, and smaller dense models from the same family. Our results show that BPF provides the best task-performance vs. memory and throughput trade-off among the considered pruning strategies. When compressing the larger RobotxR1 models, BPF-SFT/RL achieves a $1.69\times$ better size-end-to-end performance trade-off than directly selecting a smaller dense model from the same family, measured as removed parameters per lost percentage point of control adaptability. On the Jetson AGX Orin mounted on the target robotic platform, the compact models improve decode throughput by up to $27\%$.
HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model
Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.
Simulation of collision avoidance behavior in crowd movement by data-driven approach
Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.
comment: 31 pages, 9 figures
TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues
Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.
Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning
In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision-language models (VLMs) to preemptively predict an exhaustive list of common-sense dangers, it remains difficult to capture possible interaction and embodiment-dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few-shot modeling of transient anomalies. We leverage semantic voxel-centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction-driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving
Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.
Building Generalization Into Behavior Generation Via Adaptive Compositions of Regularities
Generalization in robotics requires prior knowledge about how the world is structured, yet this structure changes from one situation to the next. This paper investigates the proposition that generalization arises from adaptively composing regularities -- predictable relationships within the robot-environment system -- into situation-appropriate structures for behavior generation. We examine this proposition by analyzing the mechanism in AICON (Active InterCONnect), a framework representing regularities as interacting processes in a differentiable network, where sensory feedback realizes composition and gradient descent generates behavior. To isolate adaptive composition as the key mechanism, we study a simple simulated problem in which all relevant regularities can be identified. We expose the resulting model to a wide range of novel conditions not considered during design, and we find that it generates context-appropriate behavior in all but one case, where encoded regularities are provably insufficient. Ablations reveal that the network automatically modulates which regularities influence behavior based on their informativeness. These results suggest that adaptive composition of regularities constitutes a powerful inductive bias for building generalization into behavior generation.
comment: 10 pages, 6 figures
Seeing Fast and Slow: Bimodal 3D Scene Graphs for Open-set Tasks
Open-set task execution can significantly benefit from seamlessly switching between coarse and fine scene representations depending on the context and the evolving information as the robot explores the environment. For example, it is often sufficient to start with a coarse scene representation initially and only employ a finer, more granular scene representation when the robot encounters regions which are likely to contain the task relevant objects. Hence, in this work, we propose BiMoSG, a bimodal 3D scene graph generation approach for open-set tasks. BiMoSG employs a "fast" mode by default to efficiently generate a coarse 3D scene graph and can switch to a "slow" mode for generating a finer open vocabulary 3D scene graph of task relevant objects. We demonstrate that our proposed 3D scene graph generation approach is significantly faster than the open-source state-of-the-art approaches. This allows us to integrate the scene graph generation process with task execution for real-time deployment.
Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air
Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.
comment: Code at https://github.com/louiszengCN/CarlaAir
A study on a Real-Time VR-Based Teleoperation Framework for Manipulator in Dynamic Environment
Robot teleoperation enables safe, non-contact task execution in hazardous environments where direct human access is difficult, and its application has expanded with recent VR technologies. Many VR teleoperation studies, however, have primarily served as data-collection tools for robot imitation learning, so they often do not explicitly address dynamic obstacles, workspace changes, or collision risks during operation. For real deployment aimed at operator safety, teleoperation must react to dynamic situations with low latency and remain robust to mistakes made by inexperienced operators. This paper presents a VR teleoperation framework that supports real-time manipulation while handling collisions with both static and moving obstacles. The framework integrates GPU-accelerated inverse kinematics and trajectory optimization within a VR interface to generate feasible joint commands at each control cycle under robot constraints. Experiments with a 7-DoF manipulator demonstrate stable online behavior and collision-aware motion generation across three scenarios: obstacle-free, static-obstacle, and moving-obstacle environments. The results indicate that the proposed approach generates motion consistent with the operator's command while producing safe detours when obstacles interfere with the commanded path.
comment: This manuscript has been submitted for possible publication
RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.
comment: 13 pages, 4 figures, 3 tables
Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization
Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.
Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control
To enable an efficient interaction of non-communicating mobile robots in collision avoidance scenarios, we present a novel combined trajectory planning and prediction algorithm. Inverse optimal control is used to estimate unknown goal states of all robots based on observed past trajectories. Each robot also takes the perspective of other robots in considering self-prediction and solves a joint prediction problem using the estimated goal states. The resulting predictions are then considered for planning. Simulation results of scenarios with 2-8 robots show that the median of the durations until all vehicles reach their goals is 9.8 % faster compared to planning with constant acceleration based estimated goal states. Moreover, the proposed approach never leads to the solver being unable to find a solution to the planning or prediction problem.
Wall-OSS-0.5 Technical Report
Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning.This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware.The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
High-Load-Density Electro-Permanent Magnetic Foot with Controllable Adhesion for Quadruped Wall-Climbing Robots
To enable reliable climbing locomotion of quadruped robots on ferromagnetic surfaces, this paper presents a high-load-density electro-permanent magnetic foot with controllable adhesion, featuring force-feedback circular Halbach-net electro-permanent magnet (CHN-EPM) adhesion units and a magnetization control system. Due to its three-dimensional magnetic circuit structure and flux-concentration effect, the CHN-EPM enables a distributed parallel magnetic flux path with enhanced flux utilization, resulting in reduced sensitivity to air-gap variations and allowing effective adhesion to be maintained even under partial contact conditions. The proposed CHN-EPM generates a maximum adhesion force exceeding 1000 N with a load-to-weight ratio over 200:1. A magnetization driver and a two-stage pulse current control strategy are developed to regulate the excitation current amplitude and duration, enabling accurate and reliable magnetization. By incorporating a flexible pressure sensor for contact force feedback, the system can effectively monitor attachment and detachment states, ensuring robust adhesion switching under uncertain contact conditions. The proposed system is integrated into a commercial quadruped robot (Unitree GO2), demonstrating high-load adhesion on ceiling and vertical-wall surfaces and stable locomotion on painted, perforated, and curved ferromagnetic surfaces.
comment: 10 pages, 6 figures, 2 tables; project page and videos available in the repository
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring
Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $π_0$, and $π_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.
Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning
Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.
Two Degree-of-Freedom Vibratory Transport in a Grasp
In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.
Object-Informed Model Predictive Path Integral Control for Non-Prehensile Robot Manipulation
Long-horizon planning for non-prehensile robot manipulation is challenging due to underactuated and discontinuous interactions. We propose a hierarchical formulation of model predictive path integral (MPPI) control that guides robot-level planning with a separately computed object-level plan to achieve efficient long-horizon prediction. We first solve a simplified object-only problem, assuming the object can be actuated directly, and use the planned object trajectory as a reference in solving the joint robot-object planning problem. We evaluate our method in both simulation and hardware using a 6-DoF xArm6 manipulator to perform object pushing tasks in which the target object must reach a goal while avoiding static obstacles, necessitating non-myopic reasoning. Our object-informed MPPI increases task success by 40\% with a 26\% faster control frequency in simulation, and by 20\% in real experiments with similar computation as regular MPPI.
DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition
A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.
comment: Under review
SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World
Extending humanoid traversal to the open world is key to practical deployment in human environments, but remains challenging. The robot must use vision to ensure safe and reliable foot placement on heterogeneous terrain under highly dynamic motion, while producing coordinated, natural whole-body behaviors. We propose SSR, an efficient end-to-end framework for egocentric vision-based humanoid traversal that jointly learns these capabilities. SSR introduces imagined foothold guidance, which learns to model forthcoming swing-foot contacts and evaluates their support to guide pre-touchdown swings toward stable regions, reducing edge slips. It further employs equivariant latent-space symmetry augmentation to efficiently induce bilateral coordination under high-dimensional visual observations, and uses terrain-specific multi-discriminator motion priors to encourage human-like behavior across scenes. Extensive experiments show that SSR achieves safe, stable, and high-quality locomotion on diverse real-world terrains, including stairs with varied structures and extreme challenges such as wide gaps and high platforms, while enabling reliable long-horizon traversal in open outdoor environments.
FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance
Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importance-weighted supervised learning, they are prone to importance weight collapse, which limits their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the sampling region, avoiding the weight degeneracy induced by importance sampling over the entire action space. To instantiate this insight, we introduce \textbf{FLAG} (\textbf{F}low policy with \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance). FLAG augments the state space with a flow latent variable and optimizes a provably consistent proxy MaxEnt-RL objective. We empirically demonstrate that FLAG enables expressive policy optimization with limited importance samples and scales to high-dimensional control tasks. Furthermore, FLAG achieves state-of-the-art performance across challenging benchmarks. Our project webpage: https://flag-rl.github.io/
GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation PPSN 2026
Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.
comment: Accepted by the 19th International Conference on Parallel Problem Solving from Nature (PPSN 2026)
Geometry-Aware Control Barrier Functions for Collision Avoidance via Bernstein Polynomial Approximations ICRA 2026
Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein-Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.
comment: 8 pages; Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Primitive Subspaces Mediate Few-Shot Transfer in VLAs
Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in \{0, 1, 3, 5, 10\}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.
WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation
Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.
Literary Emotions in Motion: A Soft Robotics Installation for Tactile Storytelling
Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.
comment: 8 pages, 8 figures
SoFiE: Soft Finger Exoskeleton for Intelligent Grasping
Soft wearable robotic systems have emerged as a promising solution for assisting individuals with reduced hand function. This paper presents SoFiE, a modular soft finger exoskeleton designed to assist index-finger flexion during grasping tasks. The proposed system is primarily fabricated using 3D-printed flexible materials, enabling a lightweight, low-profile, and modular design. Actuation is achieved through a tendon-driven mechanism powered by a compact DC motor, while passive extension is provided by a compliant conductive spring. This element, termed StretchSense, also functions as a proprioceptive sensor by exhibiting resistance changes under deformation. Furthermore, a novel tactile sensing approach, MagSense, is introduced, using a magnet and magnetometer pair embedded in a soft fingertip structure to estimate contact force and object compliance. The system is fully untethered and controlled by an embedded microcontroller. In addition, actuator-level sensing through motor encoder feedback enables estimation of the system state, providing a foundation for safe and adaptive control strategies. Experimental validation demonstrates the capability of the system to provide reliable pose estimation, distinguish between materials with different stiffness, and generate distinct sensor signatures across different grasping tasks. This paper details the design, fabrication, and sensing concepts of the proposed exoskeleton as a proof of concept toward modular, soft, and assistive wearable robotics.
Behavior Cloning of MPC for 3-DOF Robotic Manipulators ICRA 2026
While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.
comment: Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references
Constrained Whole-Body Tracking for Humanoid Robots
Recent advances in reinforcement learning (RL) have demonstrated impressive whole-body agility for humanoid robots, yet ensuring safety and satisfying constraints -- particularly those specified after training -- remains a challenge. Towards this goal, we present ConstrainedMimic, a control framework that leverages whole-body kinematics and dynamics for real-time constraint enforcement within RL tracking policies. By integrating principles from operational space control and control barrier functions (CBFs), we enable the satisfaction of arbitrary runtime constraints on both the kinematic reference motion and the underlying dynamics. In whole-body motion-tracking and teleoperation experiments on a (simulated) Unitree G1 with a learned policy, we demonstrate collision avoidance (both with the robot body and external obstacles), joint limits, and center of mass stability constraints. By remaining consistent with the current contact mode and tracking objectives, we minimally restrict the capabilities of the policy when constraints are active. Our method is fully differentiable, runs on CPU, GPU, and TPU, and can be deployed at up to 300-500 Hz. All software will be freely available upon publication.
FAIR^2 Drones: An AI-Ready Standard for Cross-Domain Wildlife Drone Datasets
Animal ecology data collection using drones represents a substantial investment of time, expertise, and financial resources. Yet most existing datasets serve only a single research community, limiting interdisciplinary reuse. We propose a unified drone dataset standard, FAIR^2 Drones, that bridges ecology, robotics, and computer vision by building on existing FAIR and AI-ready data frameworks while adding essential platform metadata and annotation specifications. Our standard enables datasets to simultaneously support ecological analysis, robotics algorithm development, and computer vision benchmarking. We provide open-source validation tools, reference implementations, and multimodal extensions linking drone imagery with complementary sensors such as camera traps, GPS, and acoustics. By standardizing metadata across disciplines, this framework maximizes the scientific return on investment for costly field deployments and accelerates cross-domain collaboration in environmental monitoring.
Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps
Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.
DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties ICRA 2026
Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation and real-world dynamics. We address this issue in the context of maneuvering with double-Ackermann-steering mobile robots, which introduce additional constraints due to their non-holonomic nature. Building upon the DRL framework ManeuverNet, we extend its objective from position control to full pose control, resulting in a more challenging task. We further investigate the impact of actuation-related uncertainties on policy transfer. The use of simplified actuation models during training of the extended policy can lead to poor generalization, shown by a success rate drop from 100% in PyBullet to 25% in Gazebo under stricter evaluation conditions. To address this limitation, we adopt a sim-to-sim-to-real approach, where actuation effects observed in Gazebo are incorporated into the PyBullet training environment. Using multi-environment DRL with SAC and CrossQ, we learn policies that remain robust despite modeling inaccuracies. This approach can significantly reduce the performance gap across simulators, achieving up to 92% success rate in Gazebo and maintaining 69% under stricter thresholds, with successful transfer to a real robot without additional tuning.
comment: 6 pages, 4 figures, 2 tables, Accepted for Uncertainty in Open-World Robotics an IEEE International Conference on Robotics & Automation (ICRA 2026) workshop
ScaRF-SLAM: Scale-Consistent Reconstruction with Feed-Forward Models and Classical Visual SLAM
Recent works have explored unifying SLAM with geometric foundation models (GFMs). However, directly using GFM predictions for tracking is highly sensitive to model capability and uncertainty, as geometric inaccuracies in the predictions can adversely affect pose estimation. To address this limitation, we present a decoupled framework that integrates classical feature-based SLAM with GFMs, which achieves higher quality and more consistent dense reconstruction. In brief, we use classical visual SLAM for robust low-latency tracking and use GFMs exclusively for mapping. By anchoring mapping to poses produced by the SLAM module and optimizing across depth scales, the proposed design avoids propagating inaccuracies from GFM predictions into pose estimation while imposing geometric constraints on the reconstruction. The system builds submaps from multiple posed keyframes and enforces scale consistency via lightweight frame and submap scale optimization. It also performs projection-based point cloud fusion within each submap, and updates submaps online to reflect trajectory updates from the feature-based SLAM. To evaluate tracking and reconstruction of our method, we introduce a loop-rich, building-scale indoor dataset with accurate sensor trajectories and LiDAR ground-truth. Experiments show that our approach achieves superior trajectory accuracy while improving reconstruction precision by 10%-20% over existing methods, with about 2 cm reconstruction error per 10 m chunk on building-scale dataset. On large-scale outdoor datasets, it attains 10 cm error per 30 m chunk (w.r.t LiDAR ground-truth models).
comment: 8 pages
Predicted-Flow Control Barrier Functions for Real-Time Safe Optimal Control
Control barrier functions (CBFs) provide real-time safety guarantees through pointwise conditions on the state. However, synthesizing a valid CBF is difficult and the resulting controllers are myopic. To address myopia, this article introduces predicted-flow control barrier functions (P-CBFs), which generalize the CBF from a function of the current state to a functional of a predicted flow under a parametrized control plan over a finite prediction horizon. For safety, a P-CBF can certify that the predicted flow is in a safe set over the entire prediction horizon. However, candidate P-CBFs suffer from the same challenge as candidate CBFs, namely, control constraints make it difficult to guarantee that the P-CBF is valid. This article resolves this challenge by introducing a terminal candidate P-CBF requiring that the predicted flow end in a backup safe set at the terminal time, and a planning-time shift that modulates the prediction horizon, providing an additional degree of freedom to ensure feasibility. The real-time control and the evolution of the control-plan parameter and planning-time shift are determined jointly by a single convex optimization that is guaranteed to be feasible and renders the associated safe set forward invariant. The resulting safe optimal flow control provides a safety certificate over the entire prediction horizon and unifies finite-horizon integral-cost optimization with safety certification. This optimization reduces to a quadratic program (QP) if the control constraints are a convex polytope. The QP implementation, termed FlowBarrier, is validated on a nonholonomic ground robot navigating a dense environment. FlowBarrier is compared to nonlinear model predictive control and two CBF-based safety filter methods across 100 trials, where FlowBarrier achieves the highest goal-reaching rate, zero safety violations, and the lowest computation time.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.
comment: Project page: https://junwon.me/StressDream/
Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation ICRA 2026
Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla
comment: 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots". Code: [https://github.com/paumontagut/per-group-mse-vla](https://github.com/paumontagut/per-group-mse-vla)
HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads
Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.
Continuous Reasoning for Vision-Language-Action
Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.
comment: Project page: https://continuous-reasoning.airoa.io
Series-Parallel Integrated Nonlinear Elastic Actuator applied to the lean motion of a bicycle simulator
Designing robots for high-torque, high-fidelity haptic interaction is challenging. Parallel Elastic Actuators (PEAs) use elastic elements in parallel to smaller motors to complement torques, and Series Elastic Actuators (SEAs) use elastic elements in series to decouple motor impedance and improve force control. Recent work combines SEAs and PEAs to obtain both benefits but requires separate elastic elements or clutching. This paper presents the Series Parallel Integrated Nonlinear Elastic Actuator (SPINEA), which merges SEA and PEA such that a single elastic element takes on dual roles simultaneously, parallel and series. This is achieved by a nonlinear transmission in which the motor and load have misaligned rotation axes and are elastically connected. This geometry enables both high peak torque and precise torque tracking. We apply SPINEA to actuate lean of a haptic bicycle simulator, which requires high moments and precise rendering for safe and realistic rider interactions. We realized a prototype and performed experiments, both with an external excitation setup and with riders cycling. Our results confirm SPINEA's low impedance and precise torque tracking, up to 4.25 Hz with the bicycle frame fixed and up to 4 Hz with riders. The benefits may transfer to other applications requiring compact, high-performance actuation.
Cuttlebot: a platform demonstration for complex, autonomous, bio-inspired swimmers
Increasing interest in deep-sea operations and resources motivates the development of ecologically sensitive but environmentally durable robots. Dielectric elastomer actuator artificial muscles are good candidates for powering such systems due to their pressure and temperature tolerance and soft makeup, but they are difficult to integrate with robotic systems. This work presents an autonomous robotic platform: the CORE, capable of driving six artificial muscles while sensing visual and spatial information. To validate the platform, we developed the Cuttlebot - a cuttlefish-inspired robot that swims in three dimensions using undulatory fin locomotion. The Cuttlebot has four primary artificial muscles in its fins in addition to a tentacle-inspired soft gripper. The robot was evaluated in a series of tethered and untethered swimming tests, demonstrating a top speed of 2.5 centimeters per second translation and 10 degrees per second rotation. Furthermore, the CORE system was capable of driving specialized control signals into the artificial muscles to controllably output force and torque in six axes. This work provides a platform for developing complex, bio-inspired swimming robots for ocean exploration and monitoring, laying the foundation with our leading example: the Cuttlebot.
TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments ICML
Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/
comment: International Conference on Machine Learning (ICML) 2026
Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks ICRA 2026
Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.
comment: To appear at ICRA 2026
Mixture of Horizons in Action Chunking ICML 2026
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/
comment: Accepted at ICML 2026
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.
comment: Project Website: https://world-action-verifier.github.io
Mollified Value Learning
Offline goal-conditioned reinforcement learning (GCRL) learns goal-reaching behaviors from static datasets, but accurate value estimation remains challenging under limited state-action coverage. Existing physics-informed approaches address this by imposing pointwise distance-like geometric constraints derived from Hamilton--Jacobi--Bellman (HJB) optimality principles, often through first-order partial differential equations such as the Eikonal equation. However, enforcing local consistency through explicit differential structure can become unstable in complex, high-dimensional environments. Our key insight is to instead reinterpret distance-like constraints as an expectation over a local spatial measure. By aggregating constraints over this measure rather than evaluating them pointwise, the objective acts as a spatial mollifier, inducing distance-like value geometry without requiring expensive differential operators. We refer to this as Mollified Value Learning (MVL). Experiments across navigation and high-dimensional robotic manipulation tasks show that MVL learns structured, value representations, improving goal-reaching performance, when used with implicit value representation learning methods. Open-source codes are available at https://github.com/HrishikeshVish/MVL.
Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA
Driving agents trained in one simulated town often perform poorly in a new town because the road shapes, intersections, and lane layouts can be different. This paper studies how to improve this kind of transfer in the CARLA driving simulator without giving the agent any training data from the test towns. The agent is trained only in Town05 and Town06, then evaluated directly in Town03 and Town04. To focus on road-layout differences, all experiments use the same weather and traffic settings. We propose a training method that encourages the agent to learn features that are useful across towns rather than features tied to one training town. During training, the agent is asked to predict the high-level visual meaning of future camera views and is also discouraged from relying on cues that reveal which source town the data came from. These extra learning signals are used only during training; at test time, the driving policy uses the same observation and control interface as the baseline agent. In controlled comparisons with matched DreamerV3-style world-model driving agents, the proposed method achieves the highest mean held-out success: 36.6\% on Town03 with a 95\% confidence interval of [30.5, 42.7] and 85.6\% on Town04 with a 95\% confidence interval of [84.0, 87.2], computed across five training seeds. Seed-paired tests against the strongest primary baselines show positive success-rate differences in both held-out towns. Additional experiments show that predicting future visual meaning alone or removing town-specific cues alone is not enough to match the combined method. These results suggest that combining future-scene understanding with reduced reliance on source-town-specific features can improve cross-town driving performance in this CARLA setting.
LangMap: A Human-Verified Benchmark for Hierarchical Open-Vocabulary Goal Navigation
Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap
Motion Tracking with Muscles: Predictive Control of a Parametric Musculoskeletal Canine Model
We introduce a novel musculoskeletal model of a dog, procedurally generated from accurate 3D muscle meshes. Accompanying this model is a motion capture-based locomotion task compatible with a variety of control algorithms, as well as an improved muscle dynamics model designed to enhance convergence in differentiable control frameworks. We validate our approach by comparing simulated muscle activation patterns with experimentally obtained electromyography (EMG) data from previous canine locomotion studies. This work aims to bridge gaps between biomechanics, robotics, and computational neuroscience, offering a robust platform for researchers investigating muscle actuation and neuromuscular control.We plan to release the full model along with the retargeted motion capture clips to facilitate further research and development.
LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation
Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.
Cross-Entropy Optimization of Physically Grounded Task and Motion Plans
Autonomously performing tasks often requires robots to plan high-level discrete actions and continuous low-level motions to realize them. Previous TAMP algorithms have focused mainly on computational performance, completeness, or optimality by making the problem tractable through simplifications and abstractions. However, this comes at the cost of the resulting plans potentially failing to account for the dynamics or complex contacts necessary to reliably perform the task when object manipulation is required. Additionally, approaches that ignore effects of the low-level controllers may not obtain optimal or feasible plan realizations for the real system. We investigate the use of a GPU-parallelized physics simulator to compute realizations of plans with motion controllers, explicitly accounting for dynamics, and considering contacts with the environment. Using cross-entropy optimization, we sample the parameters of the controllers, or actions, to obtain low-cost solutions. Since our approach uses the same controllers as the real system, the robot can directly execute the computed plans. We demonstrate our approach for a set of tasks where the robot is able to exploit the environment's geometry to move an object. Website and code: https://andreumatoses.github.io/research/parallel-realization
comment: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Variance-Reduced Model Predictive Path Integral via Quadratic Model Approximation
Sampling-based controllers, such as Model Predictive Path Integral (MPPI) methods, offer substantial flexibility but often suffer from high variance and low sample efficiency. To address these challenges, we introduce a hybrid variance-reduced MPPI framework that integrates a prior model into the sampling process. Our key insight is to decompose the objective function into a known approximate model and a residual term. Since the residual captures only the discrepancy between the model and the objective, it typically exhibits a smaller magnitude and lower variance than the original objective. Although this principle applies to general modeling choices, we demonstrate that adopting a quadratic approximation enables the derivation of a closed-form, model-guided prior that effectively concentrates samples in informative regions. Crucially, the framework is agnostic to the source of geometric information, allowing the quadratic model to be constructed from exact derivatives, structural approximations (e.g., Gauss- or Quasi-Newton), or gradient-free randomized smoothing. We validate the approach on standard optimization benchmarks, a nonlinear, underactuated cart-pole control task, and a contact-rich manipulation problem with non-smooth dynamics. Across these domains, we achieve faster convergence and superior performance in low-sample regimes compared to standard MPPI. These results suggest that the method can make sample-based control strategies more practical in scenarios where obtaining samples is expensive or limited.
comment: Accepted to Robotics: Science and Systems (RSS) 2026, Sydney, Australia
Collaborative Navigation and Exploration with $β$-Sparse Gaussian Processes
Collaborative navigation of heterogeneous robots in unknown environments poses significant challenges due to sensing, communication, and computational limitations. In this work, a lead robot navigates toward a target while a mobile sensor robot (e.g., a drone) assists by transmitting information about its locally observed map under bandwidth constraints. We propose a framework that enables the sensor to jointly select its transmitted map points and navigation actions online, while also predicting unexplored regions of the environment. To this end, we present $β$-Sparse Gaussian Processes, a robust variational sparse Gaussian Process model for task-aware inducing point selection under cardinality constraints. Furthermore, we develop an action-selection strategy that balances task relevance with exploration. Simulations on Mars and Earth maps show that the framework can reduce path cost by 18% relative to no communication and decrease transmitted information by 76% compared to raw-data transmission baselines.
comment: 16 pages, 6 figures
UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption. Our view is that, in simulation-dominated robot control, the essential question is not which processor runs physics, but whether simulation throughput, policy learning, and runtime synchronization form an efficient end-to-end loop. We present UniLab, a heterogeneous CPU-simulation / GPU-learning architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. UniLab is implemented as a complete and extensible training system using MuJoCoUni and MotrixSim CPU-batched physics backends, supporting PPO, FastSAC, FlashSAC, and APPO. On representative simulation-based robot control tasks, UniLab improves end-to-end training efficiency by 3--10$\times$ under the same hardware configuration, while reducing dependence on the NVIDIA CUDA-based software stack and supporting cross-platform execution on the Apple macOS platform and the AMD ROCm and Intel XPU accelerator backends. These results show that GPU simulation is an effective path to efficient training, but not a necessary one, broadening the practical system choices available for robot RL training. Project page: https://unilabsim.github.io.
Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt ICRA
Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA), 2026
EBuddy: a workflow orchestrator for industrial human-machine collaboration
This paper presents EBuddy, a voice-guided workflow orchestrator for natural human-machine collaboration in industrial environments. EBuddy targets a recurrent bottleneck in tool-intensive workflows: expert know-how is effective but difficult to scale, and execution quality degrades when procedures are reconstructed ad hoc across operators and sessions. EBuddy operationalizes expert practice as a finite state machine (FSM) driven application that provides an interpretable decision frame at runtime (current state and admissible actions), so that spoken requests are interpreted within state-grounded constraints, while the system executes and monitors the corresponding tool interactions. Through modular workflow artifacts, EBuddy coordinates heterogeneous resources, including GUI-driven software and a collaborative robot, leveraging fully voice-based interaction through automatic speech recognition and intent understanding. An industrial pilot on impeller blade inspection and repair preparation for directed energy deposition (DED), realized by human-robot collaboration, shows substantial reductions in end-to-end process duration across onboarding, 3D scanning and processing, and repair program generation, while preserving repeatability and low operator burden.
Symmetries Here and There, Combined Everywhere: Cross-space Symmetry Compositions in Robotics
Robots exhibit a rich variety of symmetries arising from their mechanical structure and the properties of their tasks. Although many robotics problems exhibit several symmetries simultaneously, existing approaches typically treat them in isolation, failing to exploit their combined potential. This paper introduces cross-space symmetry compositions, a framework for learning robot policies that are jointly equivariant to multiple symmetries across configuration and task spaces. Leveraging the differential-geometric structure of the forward kinematics map, we both descend symmetries from configuration to task space and lift symmetries from task to configuration space, enabling their composition within a unified representation space. We validate our framework on simulated and real-world experiments on a dual-arm robot, demonstrating that jointly leveraging multiple symmetries yields improved generalization.
comment: 8 pages, 8 figures, 1 table
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments
Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain assessments. To mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of $\approx$ 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.
comment: 14 pages, 16 Figures
SignScene: Visual Sign Grounding for Mapless Navigation
Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.
comment: Under review for a conference
SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction
Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language remains challenging for end-to-end VLA policies. Slot-level placement requires reliable slot grounding and centimeter-level geometric precision. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal between language grounding and control. AnySlot converts language into a visual goal by rendering a spatial marker at the intended slot, then executes this goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution, improving semantic accuracy and spatial robustness. Furthermore, recognizing the lack of benchmarks for such precision-demanding tasks, we introduce SlotBench, a structured simulation benchmark with nine task categories for evaluating spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and modular grounding methods in zero-shot slot-level placement.
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics
We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries ICML 2026
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
comment: ICML 2026
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This also suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hyper-DP3 (HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.
Meta-Adaptive Beam Search Planning for Transformer-Based Reinforcement Learning Control of UAVs with Overhead Manipulators under Flight Disturbances
Drones equipped with overhead manipulators offer unique capabilities for inspection, maintenance, and contact-based interaction. However, the motion of the drone and its manipulator is tightly linked, and even small attitude changes caused by wind or control imperfections shift the end-effector away from its intended path. This coupling makes reliable tracking difficult and also limits the direct use of learning-based arm controllers that were originally designed for fixed-base robots. These effects appear consistently in our tests whenever the UAV body experiences drift or rapid attitude corrections. To address this behavior, we develop a reinforcement-learning (RL) framework with a transformer-based double deep Q learning (DDQN), with the core idea of using an adaptive beam-search planner that applies a short-horizon beam search over candidate control sequences using the learned critic as the forward estimator. This allows the controller to anticipate the end-effector's motion through simulated rollouts rather than executing those actions directly on the actual model, realizing a software-in-the-loop (SITL) approach. The lookahead relies on value estimates from a Transformer critic that processes short sequences of states, while a DDQN backbone provides the one-step targets needed to keep the learning process stable. Evaluated on a 3-DoF aerial manipulator under identical training conditions, the proposed meta-adaptive planner shows the strongest overall performance with a 10.2% reward increase, a substantial reduction in mean tracking error (from about 6% to 3%), and a 29.6% improvement in the combined reward-error metric relative to the DDQN baseline. Our method exhibits elevated stability in tracking target tip trajectory (by maintaining 5 cm tracking error) when the drone base exhibits drifts due to external disturbances, as opposed to the fixed-beam and Transformer-only variants.
comment: The paper will be reworked significantly
DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind
comment: 9 pages, 6 figures
Neurosim: A Fast Simulator for Neuromorphic Robot Perception
Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .
comment: 11 pages, 6 figures
HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames
Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.
CloSE: A Geometric Shape-Agnostic Cloth State Representation ICRA 2026
Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/
comment: Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/
Feedback Matters: Augmenting Autonomous Dissection with Visual and Topological Feedback
Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both planning-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.
HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds
How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.
comment: 11 pages, 3 figures, 3 tables
Learning Transferable Motor Skills for Geometry-Aware Robotic Surface Tasks ICRA 2026
Robotic surface-interaction tasks, such as spray painting or welding, require both accurate geometric planning and precise motion execution. While modern motion planners generate valid geometric paths, they often lack the expert motor patterns observed in human operators. Conversely, learning from demonstration often tightly couples task execution to the specific training geometry, limiting transferability. We propose a modular framework that decouples geometric motion planning from execution-level expertise. Expert behavior is represented as a vocabulary of interpretable, atomic motor rules, such as velocity scaling and orientation offsets, that systematically modify a geometrically planned reference path. We train a multimodal neural network to infer rule parameters jointly from kinematic trajectory data and CAD model geometry. We evaluate our approach through dynamic simulation on L-shaped and window-shaped objects, demonstrating on simulated data that the model successfully extracts velocity and orientation rules across both topologies.
comment: In: Workshop on Geometry in the Age of Data-Driven Robotics, ICRA 2026, Vienna, 2026
CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping
Vision-language-action (VLA) models have recently emerged as a promising paradigm for robotic control, enabling end-to-end policies that ground natural language instructions into visuomotor actions. However, current VLAs often struggle to satisfy precise task constraints, such as stopping based on numeric thresholds, since their observation-to-action mappings are implicitly shaped by training data and lack explicit mechanisms for condition monitoring. In this work, we propose CLAW (CLIP-Language-Action for Weight), a framework that decouples condition evaluation from action generation. CLAW leverages a fine-tuned CLIP model as a lightweight prompt generator, which continuously monitors the digital readout of a scale and produces discrete directives based on task-specific weight thresholds. These prompts are then consumed by $π_0$, a flow-based VLA policy, which integrates the prompts with multi-view camera observations to produce continuous robot actions. This design enables CLAW to combine symbolic weight reasoning with high-frequency visuomotor control. We validate CLAW on three experimental setups: single-object grasping and mixed-object tasks requiring dual-arm manipulation. Across all conditions, CLAW reliably executes weight-aware behaviors and outperforms both raw-$π_0$ and fine-tuned $π_0$ models. A video of our paper is available online https://youtu.be/MuMYj2QgReI.
comment: 8 pages, 5 figures, Video: https://youtu.be/MuMYj2QgReI
Multiagent Systems
Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning
In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent settings, their application to MARL remains limited by an inability to handle teammate-induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer-style recurrent state-space model (RSSM) into environment and teammate components, and learns an auxiliary Theory-of-Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero-shot and few-shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human-compatible AI.
comment: 5 pages, 2 figures. Accepted as a poster at the 2026 World Modeling Workshop. Conceptual workshop paper
Social welfare optimisation under institutional reward and punishment
Institutional incentives are widely used to promote cooperation among autonomous, self-regarding agents, from human societies to multi-agent and AI systems. Existing work typically treats incentive design as a bi-objective problem: minimise institutional cost while achieving a high long-run frequency of cooperation. Whether such schemes also maximise social welfare - total population payoff net of institutional expenditure - has remained largely unexplored. We develop a welfare-centric framework for institutional incentives in finite, well-mixed populations playing a social dilemma (Donation Game and Public Goods Game), considering both rewards for cooperators and punishments for defectors. For each mechanism, we derive explicit expressions for expected social welfare and characterise how it depends on incentive efficiency and selection intensity. Analytically, we identify parameter regimes where social welfare has a single optimal incentive level and regimes with qualitative phase transitions, in which welfare becomes non-monotonic with multiple local optima. We prove that any welfare-maximising incentive is either zero or concentrated around a simple closed-form target, and we provide an efficient algorithm to compute these optima. Comparing reward and punishment, we further derive close-formed conditions under which reward outperform punishment in terms of social welfare for any given budget. Overall, our results reveal a systematic gap between incentives optimised for cost or cooperation frequency and those that maximise welfare.
Generalized Intention Modeling in Multi-Agent Reinforcement Learning
Modeling an opponent's intent is critical for effective decision-making in non-cooperative, competitive, and general-sum multi-agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived from episode information chosen a priori, such as the opponent's next action or a future environment state, and use this to guide the ego-agent's behavior. These approaches assume that the chosen information is universally representative of intent; however, we show empirically that this is not the case as intentions are often task- and environment-dependent. To address this, we introduce a task-adaptive opponent modeling framework that learns a performance-driven mixture of multiple intent representations. We further introduce a new intention representation that maximizes mutual information with the ego-agent's future returns, thereby capturing opponent information that is most directly relevant to performance. Our approach consistently matches or exceeds the performance of state-of-the-art baselines across diverse tasks and yields insights into when and why different opponent modeling strategies succeed.
Comparing Market Mechanism Efficiencies
We develop a game-theoretic framework that compares welfare efficiency across three market mechanisms: continuous double auctions with transparent order books (lit exchanges), opaque order books (dark pools), and periodic batch auctions. Each mechanism is modeled as a queuing system where heterogeneous traders face trade-offs between the execution price, waiting costs, and transaction costs. Our main result establishes that under moderate arrival rates and bounded adverse selection, dark pools dominate both alternatives in aggregate ex-ante welfare. Observable order books create costly strategic timing games in which traders delay or rush submissions to optimize their position in the queue, generating wasteful social waiting costs. Opaque order books eliminate these timing games through information design. We formally characterize the equilibrium strategies in each mechanism and prove the welfare ranking $W^{DARK} > W^{LIT} > W^{BATCH}$. Extensions incorporate asymmetric information and endogenous venue choice. The results demonstrate how the information structure and the discipline of the service jointly determine efficiency in strategic matching environments.
comment: 79 pages
HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster ECML-PKDD 2026
This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.
comment: Accepted in ECML-PKDD 2026. arXiv admin note: text overlap with arXiv:2511.12792
Safe Equilibrium Policy Optimization for Strategic Agent Policies EMNLP 2026
Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -- exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo{}), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo{} as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B-it and Qwen~3.5-4B after supervised fine-tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo{} achieves zero exploit-pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over-cooperative behavior introduced by SFT. In negotiation, \sepo{} achieves a positive-safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per-rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control-variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \href{https://anonymous.4open.science/r/sepo-2668/README.md}{code} and SFT datasets.
comment: Submitted to EMNLP 2026
Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution
Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.
comment: 34 pages, 11 figures
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.
Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response
Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.
comment: 32 pages, 18 figures, 4 tables
Leveraging the Learning Curve: Reusing Existing Architectural Patterns to Design and Implement MAS
Recent advancements in AI have led to the development of specialized systems related to multi-agent systems (MAS). However, the inherently collaborative nature of agents is often overlooked, and many of these specialized systems are used as components by other AI systems. From a software engineering perspective, this context can benefit from aligning the architectural characteristics of distributed systems with the inherently distributed nature of MAS. We propose that introducing a minimal set of agent-related concepts into the Distributed Systems (DS) domain can improve the engineering of modern MAS by leveraging techniques from DS engineering with established agent theory. In this study, we recapitulated the common origins of MAS and DS by drawing architectural parallels to establish a unified engineering approach. We then defined a minimal set of agent concepts to perform two practical studies on leveraging MAS development. First, we incorporated these concepts into a DS architectural pattern to design a distributed MAS. We then used these concepts in a graduate course to teach MAS engineering to students with no prior knowledge of agent theory. The learning outcomes from both courses included successful MAS implementation using DS tools and techniques. Although more than two-thirds of these students had no practical experience in developing distributed systems, the average final grade in both courses was above 80\%, thus validating our approach. Finally, we discuss how this study supports the development of advanced systems using modern AI techniques consistently with established agent-related research while leveraging established DS techniques and concepts.
comment: Author's accepted manuscript of an article published in IEEE Access. 17 pages, 6 figures. IEEE Access, vol. 13, pp. 45809-45825, 2025. Copyright 2025 IEEE. Personal use of this material is permitted. The final version is available at https://doi.org/10.1109/ACCESS.2025.3546526
MindZero: Learning Online Mental Reasoning With Zero Annotations ICML 2026
Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
comment: ICML 2026. Website: https://scai.cs.jhu.edu/MindZero
Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence
We argue that governance must transition from a normative discipline to an engineering discipline, and we develop a formal framework, inspired by the physics of metamaterials, to make this transition quantitative and testable. Artificial General Intelligence affects civilization primarily by increasing decision velocity while human verification capacity remains bounded. When the cost of validating AI-generated outputs exceeds the expected utility of acting on them, rational agents default to inaction: a stable but catastrophic Nash equilibrium we term the Freezing Equilibrium. Drawing on metamaterials, where emergent macro-properties arise from designed microstructure, we develop a phenomenological constitutive law for institutional coordination: $R_{\mathrm{eff}} = β\cdot (1-ρ) \cdot (1-τ) \cdot (1-γρτ)$, where $β$ is the decision branching factor, $ρ$ is provenance fidelity, $τ$ is the verification rate, and $γ\in [0,1]$ captures correlated-detection synergy between provenance and verification failures. The model predicts a sharp phase transition between self-healing ($R_{\mathrm{eff}} < 1$) and self-destabilizing ($R_{\mathrm{eff}} > 1$) regimes. We introduce a three-class provenance taxonomy: cryptographic, institutional, and context binding, and derive four falsifiable hypotheses with a proposed 12-week stepped-wedge cluster-randomized trial in government grant review panels. The framework bridges AI alignment theory and institutional design.
comment: 19 pages, 4 figures. Accepted for presentation at AGI-26 (Springer LNAI, forthcoming). v2 corrects the sign of the synergy term in the constitutive law (Eq. 2) and reformulates H3 as a threshold-crossing claim, per peer review
DTBench: A Synthetic Benchmark for Document-to-Table Extraction KDD26
Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability-aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human-annotated document-table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi-agent synthesis workflow to generate documents from ground-truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two-level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU-DAILY/DTBench.
comment: KDD26
Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning ICML 2026
Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.
comment: 9 pages for main, 33 pages for total, Accepted to ICML 2026
Scaling Multi-Agent Environment Co-Design with Diffusion Models
The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.
DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration
Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.
Provably Convergent Actor-Critic for MARL through Risk-aversion
Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel single-timescale Actor-Critic algorithm characterized by a faster actor and a slower critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.
Consistent Opponent Modeling in Imperfect-Information Games
The goal of agents in multi-agent environments is to maximize total reward against the opposing agents that are encountered. Following a game-theoretic solution concept, such as Nash equilibrium, may obtain a strong performance in some settings; however, such approaches fail to capitalize on historical and observed data from repeated interactions against our opponents. Opponent modeling algorithms integrate machine learning techniques to exploit suboptimal opponents utilizing available data; however, the effectiveness of such approaches in imperfect-information games to date is quite limited. We show that existing opponent modeling approaches fail to satisfy a simple desirable property even against static opponents drawn from a known prior distribution; namely, they do not guarantee that the model approaches the opponent's true strategy even in the limit as the number of game iterations approaches infinity. We develop a new algorithm that is able to achieve this property and runs efficiently by solving a convex minimization problem based on the sequence-form game representation using projected gradient descent. The algorithm is guaranteed to efficiently converge to the opponent's true strategy under standard Bayesian identifiability and visitation assumptions, given observations from gameplay and possibly additional historical data if it is available.
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundation models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
comment: Accepted by FaccT 2026
Symphony-Coord: Adaptive Routing for Multi-Agent LLM Systems
Multi-agent large language model systems can tackle complex multi-step tasks by decomposing work and coordinating specialized behaviors. However, current coordination mechanisms typically rely on statically assigned roles and centralized controllers. As agent pools and task distributions evolve, these design choices can lead to inefficient routing, poor adaptability, and fragile fault recovery. We introduce Symphony-Coord, a task-local coordination framework with decentralized execution that transforms agent selection into an online multi-armed bandit problem. Instead of relying on a fixed task-to-role map, Symphony-Coord allows routing specializations to emerge from interaction and feedback. The framework employs a two-stage dynamic beacon protocol:(i) a lightweight candidate screening mechanism to limit communication and computation overhead; and (ii) an adaptive LinUCB selector that routes subtasks using context features derived from task requirements and agent states, updated through delayed post-execution feedback. Under candidate-conditional linear bandit assumptions, we prove sublinear regret bounds for the immediate-feedback selector and explicitly separate the deferred-update effects introduced by post-vote rewards. Validation through simulation experiments and real-world large language model benchmarks shows that Symphony-Coord improves task routing efficiency and recovery behavior under distribution shifts and agent failures.
comment: 41 pages,15 figures
Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, losing-branch residue, stale writes, or irreversible sends. Correct settlement needs two facts that retries, checkpoint replay, locks, and compensation each conflate: which effects must settle together, and when earlier conflicting work is exhausted. Atomix makes this split explicit with progress-aware transactions. The runtime records reads and effects during execution, seals a transaction when its footprint is complete, and commits only after per-resource frontiers show that no earlier conflicting work can still arrive. Commit is final settlement: Atomix releases bufferable effects, accepts reversible external effects as final, and lets irreversible effects leave the gate. Abort suppresses unreleased effects and compensates externalized reversible effects where possible. On representative agent workloads, this composition improves clean recovery under injected faults, isolates contending and speculative work, and prevents correctly classified irreversible actions from leaking; microbenchmarks show microsecond-scale wrapper overhead relative to tool latency.
Systems and Control (EESS)
Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation
Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.
comment: 43 pages, 12 figures, includes supplementary material
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making
Reasoning-based robotic policies using large language and vision-language models achieve strong semantic planning capabilities but mostly suffer from a high inference latency that limits practical real-time deployment. In this work, we observe that robotic reasoning workloads contain substantial temporal redundancy, where consecutive observations frequently produce identical actions and subgoals. Based on this insight, we present REIS, a human cognition inspired robotic decision-making framework that minimizes unnecessary reasoning while preserving semantic adaptability. REIS combines lightweight scene gating, KV-steered affordance routing, and deliberative reasoning to accelerate robotic control under embodied constraints. Experiments on ALFRED, and real-world robotic tasks demonstrate that REIS significantly suppresses reasoning overhead while maintaining competitive task performance.
comment: 19 pages
Ladder Logic Translation using Large Language Models in Industrial Automation
Ladder logic translation is an important problem in industrial automation because without it, it is difficult to switch Programmable Logic Controller (PLC) vendors. The prevailing translation problem highlights mismatched programming environments, incompatible ladder logic constructs, limitations in terms of differences in the semantic expressiveness of the vendor formalisms and integrated black-box proprietary engineering tools which are exemplified in our example case; Rockwell to Siemens PLC code translation. This work presents a mathematical formulation of the problem, the detailed architecture of a solution which supports XML extraction, structural normalization, constrained generative function (LLM), and system integration via the TIA Portal Openness API as rigorously engineered pipeline for automated translation of Rockwell Ladder Programs to Siemens S7 ladder programs. Finally, we present results that show that the translations retain high semantic consistency across instruction categories.
Current Practices in Electricy Demand and Charging Scheduling for On-Road Electric Fleet Operations: An Industry-Wide Review
The electrification of on-road fleet logistics promises improved air quality, lower noise emissions, major climate benefits, increased energy flexibility through the use of locally generated electricity and reduced dependence on imported fuels. However, battery electric vehicles can introduce operational planning challenges not present with internal combustion engine vehicles, including heterogeneous charging speeds, exposure to volatile electricity prices, and scarcity in infrastructure. Managing these complexities requires solutions that balance cost efficiency and robustness, supported by sector coupling between transport and electricity systems. This paper reviews the current state of digital systems for operational decision-making in electric fleet management through a grey literature analysis, drawing on practitioner-oriented sources such as industry reports, company documentation, and technical blogs that reflect real-world practices and developments. We identify key trends and gaps, providing insights to guide future research and development.
Model-free LQG Control with Chance Constraints
This paper studies model-free optimal control design and its convergence properties for linear time-invariant systems subject to probabilistic risk or chance constraints. In particular, we study a natural policy gradient (NPG)-based actor-critic (AC) algorithm with two timescales, using a Lagrangian primal-dual framework to enforce the constraint. Furthermore, the risk is defined as the probability that a function of the one-step-ahead state exceeds a user-specified threshold. To our knowledge, this is the first work to study the analytical convergence properties for NPG-based AC in a chance-constrained linear-quadratic Gaussian (LQG) regulator setting without model knowledge. We establish the coercivity and gradient dominance properties of the Lagrangian function, which ensure linear convergence and closed-loop stability during training for the actor. On the other hand, we analyse the convergence properties of the temporal difference (TD(0)) learning for the critic, applying stochastic approximation theory. Also, we demonstrate no duality gap in the constrained optimisation problem. Additionally, we have performed numerical analysis of the convergence properties and accuracy of the proposed method, comparing it with model-based chance-constrained LQR and scenario-based MPC. Results show that our approach effectively limits risk while maintaining near-optimal performance, without requiring full model knowledge or real-time optimisation.
comment: Under review at IEEE OPEN JOURNAL OF CONTROL SYSTEMS
Posterior and Likelihood Sensitivity in Bayesian Distributionally Robust Optimization
We introduce the notion of worst-case posterior and worst-case likelihood sensitivity. These measure, respectively, the sensitivity of the expected cost to worst-case perturbations of the posterior distribution and worst-case perturbations of the likelihood of a Bayesian model. Each defines a quantitative measure of robustness. A decision maker concerned about the sensitivity of the out-of-sample expected cost to deviations from her assumptions will want a decision for which both sensitivities are small. We derive posterior and likelihood sensitivities for uncertainty sets defined in terms of deviation measures. Posterior sensitivity vanishes when the posterior variance shrinks to zero, which occurs when parameter uncertainty is eliminated from learning. Parameter learning does not eliminate likelihood sensitivity. A distributionally robust formulation of a Bayesian optimization problem makes a near-Pareto-optimal tradeoff between performance (expected cost) and robustness (posterior and likelihood sensitivity).
Steering Fractional-Order Network Dynamics via Joint Parameter and State Control
This paper studies the control of discrete-time linear fractional-order networks, a flexible modeling framework for systems with long-range memory such as power grids, biological networks, and neuronal circuits. In contrast to the common view that fractional exponents (time-scales) are fixed parameters, we show that they can be systematically steered, together with the network coupling matrix, by appropriately designed input sequences. We first derive algebraic conditions under which the coupling matrix and the vector of fractional exponents of a given network can be reconfigured to desired values, and we characterize how truncating the infinite-memory term impacts the resulting dynamics. Building on these results, we construct an equivalent linear representation that isolates the contribution of memory, and we introduce a fractional reachability matrix that provides explicit conditions for jointly steering both network parameters and state in a finite number of steps. To address practical implementations, we further formulate an energy-constrained steering problem that incorporates actuator bounds and finite-memory approximations as a quadratic program. The framework is illustrated on low-dimensional toy examples, on larger networks with Erdos-Renyi, Barabasi-Albert, and Watts-Strogatz topologies, and on a brain network model inferred from electrocorticography recordings of an epilepsy patient, where we showcase transitions between pre-seizure and seizure configurations.
Safe Arrival Scheduling at Constraint Waypoints in UAM Corridors
This study introduces a novel Air Traffic Control (ATC) concept to support self-separation between vehicles in Urban Air Mobility (UAM) corridors. Our proposed scheme involves sharing intended arrival schedules at Constrained Waypoints (CWPs) among UAM operators. We propose two approaches to assist the arrival scheduling at CWPs by computing the minimum arrival time gap necessary for each pair of vehicles to ensure their safety throughout the flights within the corridor. The first approach considers the minimum separation distance required by the Near Mid-Air-Collision (NMAC) avoidance rules, while the second one is based on the Responsibility-Sensitive Safety (RSS) rules. We demonstrate that the NMAC-rule-based approach can effectively prevent collisions in normal circumstances, where the vehicles adhere to the speed limits of the corridor. However, this approach does not guarantee safety if vehicles exceed the speed limits. Conversely, while the RSS-rule-based approach ensures collision prevention during emergencies when vehicles exceed speed limits, it may require larger arrival time gaps under normal circumstances, which may lead to reduced traffic flow. Our results are confirmed through numerical simulations.
Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control
To enable an efficient interaction of non-communicating mobile robots in collision avoidance scenarios, we present a novel combined trajectory planning and prediction algorithm. Inverse optimal control is used to estimate unknown goal states of all robots based on observed past trajectories. Each robot also takes the perspective of other robots in considering self-prediction and solves a joint prediction problem using the estimated goal states. The resulting predictions are then considered for planning. Simulation results of scenarios with 2-8 robots show that the median of the durations until all vehicles reach their goals is 9.8 % faster compared to planning with constant acceleration based estimated goal states. Moreover, the proposed approach never leads to the solver being unable to find a solution to the planning or prediction problem.
Near-Optimal Mixed Strategy for Zero-Sum Linear-Quadratic Differential Games
Deriving analytic solutions for optimal mixed strategies in zero-sum linear-quadratic differential games (ZSLQDGs) remains an open problem. In this paper, we analytically synthesize near-optimal mixed strategies for ZSLQDGs and establish rigorous performance certifications. Specifically, we construct a surrogate pure-strategy stochastic differential game (SDG) by matching the first two moments of the mixed strategies. This method achieves an $\mathcal{O}(\barπ^2)$ weak approximation of state distributions and expected costs with respect to the maximum commitment delay $\barπ$. By analytically resolving the surrogate SDG, we derive closed-form optimal control laws for the matched moments. Crucially, we reveal that the surrogate game is governed by a Generalized Riccati Differential Equation (GRDE), which explicitly dictates a dynamic energy allocation law for variance injection. Building on these solutions, we propose a robust dual-routing architecture to execute the near-optimal mixed strategies. Furthermore, we certify that both the global value approximation error and the strategy suboptimality gaps are bounded by $\mathcal{O}(\barπ^{\frac{1}{2}})$. Finally, numerical experiments on a double-integrator pursuit-evasion game illustrate the induced physical behaviors and validate the theoretical bounds.
A Data-Driven Methodology for Scalable Distributed MPC in Heterogeneous Building Aggregation: From Systematic Feature Selection to Convex Optimization
Coordinating large-scale, heterogeneous building aggregations for demand response (DR) is impeded by a dual challenge: the computational intractability of centralized Model Predictive Control (MPC) and the inadequacy of conventional feature selection methods, which fail to address the error-compounding nature of multi-step forecasting required by MPC. This paper proposes a comprehensive, data-driven framework that first employs a systematic, MPC-aware feature selection methodology to ensure robust multi-step prediction, then models the complex building dynamics using a novel Input-Convex Encoder-Only Transformer (IC-EoT) to guarantee a convex optimization problem, and finally solves the resulting constraint-coupled problem (CCP) in a fully distributed manner using the Tracking Alternating Direction Method of Multipliers (ADMM) algorithm. The framework is validated in a high-fidelity co-simulation environment, controlling a heterogeneous aggregation of consumer and prosumer buildings based on the EnergyPlus under a dynamic time-of-use (TOU) tariff. Results demonstrate that the proposed distributed approach achieves near-identical economic optimality and superior thermal comfort compared to a theoretical centralized controller, while exhibiting exceptional computational scalability that overcomes the real-time infeasibility of the centralized approach for large aggregations.
comment: 13 pages, 4 figures, 3 tables
Geometry-Aware Control Barrier Functions for Collision Avoidance via Bernstein Polynomial Approximations ICRA 2026
Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein-Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.
comment: 8 pages; Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Bounds on Prediction Error When Using an Impulse Response/Equilibrium Model Structure
An impulse response/equilibrium model (IREM) structure combines a linear convolution model with a nonlinear function that sets the current operating point via an equilibrium variable with integrator dynamics. This model structure is well suited for mildly nonlinear systems and in particular has been applied to battery fast charging control. This paper provides observability conditions for the IREM model structure and bounds on the prediction error. These conditions can be evaluated directly on the system impulse response.
Verifying global identifiability of parametric linear ODE models is NP-hard
Global parameter identifiability is a property of a parametric ODE model to recover the parameter values uniquely from the input-output data. Not all parametric ODE models have this property, and checking for parameter identifiability is a prerequisite to perform numerical parameter estimation. There are many algorithms and software packages for global parameter identifiability, and frequently the runtime is large. However, the computational complexity for this problem has not been analyzed yet, though there are complexity results for local (finitely many values fit the data) parameter identifiability. In this paper, we estimate the complexity of checking global parameter identifiability over real fields for ODE models that depend linearly on the state variables and rationally on the parameters. In particular, we prove that it is equivalent to the injectivity problem.
SoFiE: Soft Finger Exoskeleton for Intelligent Grasping
Soft wearable robotic systems have emerged as a promising solution for assisting individuals with reduced hand function. This paper presents SoFiE, a modular soft finger exoskeleton designed to assist index-finger flexion during grasping tasks. The proposed system is primarily fabricated using 3D-printed flexible materials, enabling a lightweight, low-profile, and modular design. Actuation is achieved through a tendon-driven mechanism powered by a compact DC motor, while passive extension is provided by a compliant conductive spring. This element, termed StretchSense, also functions as a proprioceptive sensor by exhibiting resistance changes under deformation. Furthermore, a novel tactile sensing approach, MagSense, is introduced, using a magnet and magnetometer pair embedded in a soft fingertip structure to estimate contact force and object compliance. The system is fully untethered and controlled by an embedded microcontroller. In addition, actuator-level sensing through motor encoder feedback enables estimation of the system state, providing a foundation for safe and adaptive control strategies. Experimental validation demonstrates the capability of the system to provide reliable pose estimation, distinguish between materials with different stiffness, and generate distinct sensor signatures across different grasping tasks. This paper details the design, fabrication, and sensing concepts of the proposed exoskeleton as a proof of concept toward modular, soft, and assistive wearable robotics.
Valorisation of Fermentation Side-Stream for Waste-to-Mycoprotein: Nutrient Composition, Metabolic Insights and Process Optimisation
Fermentation-derived side streams represent an underutilised resource for sustainable protein production. This study investigates the potential of centrate from industrial Fusarium venenatum fermentation as a nutrient source for fungal biomass generation. Following compositional characterisation, a synthetic centrate medium was formulated and evaluated using a Box-Behnken design combined with response surface methodology. Across 46 experimental runs, cell dry weight (CDW) ranged from 0.22 to 3.87 g per liter, demonstrating a strong dependence on nutrient composition. Ammonia and glucose were identified as the dominant factors influencing biomass production, with significant nonlinear effects. The model predicted a maximum CDW of 4.17 g per liter under optimised conditions, which was experimentally validated at 3.99 g per liter. Carbon conversion efficiency reached up to 29.02%, indicating effective substrate utilisation. These findings demonstrate that fermentation-derived centrate can support substantial fungal growth, while highlighting its potential to enhance nutrient recovery and influence the biochemical composition of sustainable mycoprotein.
Clustering-enhanced adaptive Benders decomposition for energy systems planning optimization
High-resolution energy system capacity expansion models (CEMs) for energy transition planning often result in large-scale mixed-integer linear programming (MILP) formulations. Benders decomposition (BD) offers a scalable solution approach by iteratively solving a master problem (MP) for investment decisions and multiple subproblems (SPs) for operational decisions. However, accumulated Benders cuts generated by the SPs can make MP solution a major computational bottleneck. Incomplete SP parallelization can also introduce further bottlenecks when SPs exceed available CPUs. We develop clustering-enhanced BD methods to address these challenges, by using clustering to group similar SPs for: a) aggregated Benders cut construction and b) identification of representative SPs to be solved most frequently. For grouped-cuts, we examine two adaptive formulations based on dual variables and a fixed-grouping formulation based on exogenous time-series inputs. We evaluate these methods in an electricity-sector CEM across varying system sizes, temporal SP lengths, inter-SP coupling strengths represented by CO2 policy, computational resources, and stochastic settings. Relative to a benchmark regularized multi-cut formulation, adaptive grouped cuts outperform fixed grouping and provide substantial benefits under weak inter-temporal coupling. The largest gains occur in larger systems with shorter SP horizons, where the MP accounts for a greater share of runtime. Their effectiveness declines under strong inter-temporal coupling, such as annual CO2 emissions limits, where the benchmark multi-cut performs best. The representative-SP method outperforms the benchmark under limited parallelization when SP solution dominates runtime. Overall, the preferred BD strategy depends on inter-SP coupling strength and whether computational burden lies in the MP or the SPs.
Behavior Cloning of MPC for 3-DOF Robotic Manipulators ICRA 2026
While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.
comment: Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references
Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control
Recursive systems can enter collapse-like regimes -- self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation -- before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity ($δ$). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance. We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP $\in$ [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred. The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract -- non-acceptance reported as a first-class scientific outcome.
comment: 29 pages, 7 figures, 2 tables; supplementary materials: 9 pages, 1 figure, 4 tables. Code, derived data packets, and Lean artifact: https://github.com/davidmullett/loopzero-paper-public (release tag lean-v1.0)
Generalized Model Predictive Path Integral Control as Expectation--Maximization
Model Predictive Path Integral (MPPI) control is a powerful sampling-based method for solving stochastic optimal control problems and has enabled real-time control in complex robotic systems. Despite its empirical success, its theoretical understanding remains limited. In this work, we show that MPPI can be interpreted as a special case of the Expectation-Maximization (EM) algorithm applied to a probabilistic inference formulation of optimal control. This perspective leads to a generalized EM-MPPI framework that extends MPPI beyond the commonly used Gaussian parameterization. We analyze the convergence behavior of this algorithm and characterize the local convergence rate in terms of the covariance of the posterior trajectory distribution and the exploration distribution. For exponential-family distributions, we establish a sufficient increase property of the log-likelihood when the log-partition function is strongly convex. Specializing the analysis to Gaussian MPPI yields explicit global and local convergence characterizations. The code for the experiments will be available upon acceptance.
Symmetric Hermite quadrature-based balanced truncation for learning linear dynamical systems from derivative data
Data-driven reduced-order modeling is an essential component in the computer-aided design of control systems. In this work, we present a novel symmetric Hermite formulation of the quadrature-based balanced truncation algorithm that constructs linear reduced-order models from evaluations of the full-order system's transfer function and its derivative. Significantly, the Hermite formulation preserves desirable qualitative properties of the system used to generate the data, such as state-space Hermiticity and, consequently, asymptotic stability.
comment: 14 pages, 2 figures, 4 tables
Predicted-Flow Control Barrier Functions for Real-Time Safe Optimal Control
Control barrier functions (CBFs) provide real-time safety guarantees through pointwise conditions on the state. However, synthesizing a valid CBF is difficult and the resulting controllers are myopic. To address myopia, this article introduces predicted-flow control barrier functions (P-CBFs), which generalize the CBF from a function of the current state to a functional of a predicted flow under a parametrized control plan over a finite prediction horizon. For safety, a P-CBF can certify that the predicted flow is in a safe set over the entire prediction horizon. However, candidate P-CBFs suffer from the same challenge as candidate CBFs, namely, control constraints make it difficult to guarantee that the P-CBF is valid. This article resolves this challenge by introducing a terminal candidate P-CBF requiring that the predicted flow end in a backup safe set at the terminal time, and a planning-time shift that modulates the prediction horizon, providing an additional degree of freedom to ensure feasibility. The real-time control and the evolution of the control-plan parameter and planning-time shift are determined jointly by a single convex optimization that is guaranteed to be feasible and renders the associated safe set forward invariant. The resulting safe optimal flow control provides a safety certificate over the entire prediction horizon and unifies finite-horizon integral-cost optimization with safety certification. This optimization reduces to a quadratic program (QP) if the control constraints are a convex polytope. The QP implementation, termed FlowBarrier, is validated on a nonholonomic ground robot navigating a dense environment. FlowBarrier is compared to nonlinear model predictive control and two CBF-based safety filter methods across 100 trials, where FlowBarrier achieves the highest goal-reaching rate, zero safety violations, and the lowest computation time.
Social learning community detection with nonlinear interaction
Conventional community detection requires centralized network data, making it unsuitable for distributed or privacy-preserving systems. In this paper, we demonstrate that macroscopic graph partitioning can emerge purely from strictly local, privacy preserving interactions driven by social learning. By reframing clustering as a symmetry-breaking process within nonlinear opinion dynamics, we show that exchanging saturated state dependent signal (like public actions) forces a network to naturally fracture along its sparsest cuts. We mathematically establish the spectral conditions under which dense core communities lock into stable, polarized states, robustly resisting external influence. To apply this mechanism, we propose three decentralized algorithms, leading up to the Score-based Edge Reliability (SER) framework. By evaluating network ties across multiple independent discussion topics, SER statistically bypasses the errors of traditional greedy bisections and naturally isolates structurally ambiguous frontier nodes. Validations on the ABCD benchmark and the real-world Ngogo chimpanzee network confirm that our fully decentralized approach matches the accuracy of globally optimized heuristics (e.g., Louvain, Leiden) up to a theoretical limit of detectable graphs.
Symmetry-Protected Quantum Computing using Metamaterials
We propose a new architecture for practical quantum computing that combines three established principles: symmetry protection of relative-motion qubits via the generalized Kohn theorem, control via twisted-light orbital angular momentum, and metamaterial nanofocusing (e.g. using Weyl-semimetal plasmonics). Crucially, the core mechanism is generic: it applies to any current or future quantum computing system involving parabolic confinement, including cold atoms, ions, and semiconductor dots.
comment: Invited talk at META (International conference on Metamaterials, Photonic Crystals and Plasmonics), Dublin, July 14-17, 2026
Contraction Analysis of Time-Delay Systems
In this paper, we investigate contraction analysis for nonlinear time-delay systems described by functional differential equations. We first extend the concept of Lyapunov-Krasovskii functionals within the differential framework. We then show that its existence is equivalent to that of an incremental Lyapunov-Krasovskii functional and guarantees uniform incremental exponential stability. Next, we extend the concept of Lyapunov-Razumikhin functions within the differential framework, whose existence also ensures uniform incremental exponential stability. As an application of our results, we formulate stabilizing feedback control design for nonlinear time-delay systems with single delays in terms of linear matrix inequalities.
Feedback Control of a Recirculating Bioreactor with Electrophoretic Removal of Inhibitory Extracellular DNA
Extracellular DNA accumulation in recirculating bioprocesses inhibits microbial growth and reduces productivity. We consider a continuous bioreactor with a recirculating loop and an electrophoretic filtration unit for selective DNA removal, and develop a feedback control framework combining online state and parameter estimation via an Unscented Kalman Filter with an adaptive Model Predictive Controller that jointly optimizes dilution rate and filtration activation. Closed-loop simulations under nominal and perturbed conditions show that the addition of the filtration unit enables the proposed control strategy to achieve significantly higher cumulative profit while keeping DNA concentration below the inhibition threshold.
Saturation-aware robust optimal operation control of microgrids based on minimum-regret optimization
This paper studies robust optimal operation control problems for microgrids with a high share of renewable energy sources. The main goal is to ensure an optimal operation in the presence of a wide range of scenarios of uncertain infeed of renewable sources and uncertain load demand. We formally state a minimum-regret robust model predictive control (MPC) problem and address it by making effective use of a hierarchical microgrid control structure. In detail, we consider an enhanced primary control layer composed of droop control and an autonomous limitation of power and energy. We prove that this enables us to use constant power setpoints to achieve an optimal operation under certain conditions. To obtain a tractable controller, we then combine the abovementioned constant saturation-aware setpoints with an energy management system, which solves a robust unit commitment problem within a model predictive control framework. In a case study, we finally demonstrate the viability of the control design.
Quantum Hardware-in-the-Loop for Optimal Power Flow in Renewable-Integrated Power Systems
Quantum computing has emerged as a promising computational paradigm to address unresolved challenges in the modeling and control of modern power systems. However, most existing studies focus on offline simulations, and a practical framework for validating quantum algorithms in real-time operational environments remains lacking. This study proposes a quantum hardware-in-the-loop framework that integrates a real-time digital simulator with quantum and quantum-inspired hardware to solve combinatorial power flow and optimal power flow formulations under dynamic operating conditions. The proposed framework is validated using the IEEE 9-bus test system and a modified version with integrated solar and wind farms. The results confirm successful integration and convergence within a predefined tolerance. The study also identifies key limitations and challenges, such as limited access to quantum and digital annealers and current scalability limitations, that must be considered in future developments. Nevertheless, the results highlight the potential of quantum computing to significantly enhance the modeling and control of future power systems with high penetration of renewable energy sources.
comment: 11 pages, 6 figures, 6 tables
Grid Capacity Expansion under Data Centers and Electrified Manufacturing Large Loads
In this paper, we consider the expansion of power grids under emerging large loads from data centers and electrified manufacturing. We develop a multi-period grid capacity expansion model to determine optimal investment profiles for power generation, storage, and transmission capacity while accounting for hourly power dispatch, such that electricity demand is satisfied and the total planning and operation cost is minimized. We also propose a new modeling approach regarding the spatial distribution of demand from large loads. The model is used to analyze the expansion of a synthetic grid that follows key characteristics of the ERCOT system over a seven-year planning horizon, under loads from data centers and electrified oil refining, which account for 17.5% and 4.7% of total annual electricity demand by the end of the planning horizon. The optimal investment policy leads to an 83.6% increase in generation capacity and exploits the short construction times of solar and storage as well as the operational flexibility of thermal generators. Finally, sensitivity analysis reveals that the construction time of grid assets substantially impacts investment timing, generation technology mix, and transmission capacity expansion. The proposed modeling framework is general and can be extended to other grid systems, enabling the exploration of diverse demand scenarios, policy assumptions, and regional characteristics.
Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization
Deep reinforcement learning for continuous control often suffers from high variance, low energy efficiency, and poor generalization under distribution shift, as purely data-driven exploration ignores available physical structure. This paper proposes Hybrid Energy-Aware Reward Shaping (H-EARS), which encodes dominant energy terms -- assumed known a priori -- directly as reward potentials at O(n) per-step computation. H-EARS decomposes the shaping potential into task-oriented and energy-based components, supplemented by an action regularization term that deliberately modifies the optimization objective to enforce energy-efficient control. A complete theoretical foundation is established: functional independence of shaping and regularization, energy-based gradient enrichment under positive-definite Hessian conditions, convergence guarantees under function approximation, and approximate potential error bounds. Across four continuous control benchmarks and four baseline algorithms, H-EARS achieves consistent gains in convergence speed, policy stability, and final performance. High-fidelity vehicle simulations validate applicability in safety-critical settings under extreme road conditions.
comment: 23 pages, 48 figures. Accepted by Neurocomputing
OCO-S$^2$: Online Convex Optimization with Stateful Costs and Sparse Communication
We study \textsc{OCO-S$^2$}, an online convex optimization setting in which decisions drive a stable dynamical state, losses are incurred along the induced state trajectory, and first-order feedback is available only through sparse block communication with partial participation. This coupling creates a dynamic-regret problem beyond pointwise OCO: the learner updates and holds decisions at the block scale, whereas the hindsight comparator may vary at the per-round scale. We propose \textsc{OCO-S$^2$-OGD}, a projected block online gradient method that updates deployed decisions using sparse block-level distributed feedback. We prove dynamic-regret bounds for the incurred trajectory cost, quantifying the tradeoff among block communication, comparator variation, state-memory truncation, and partial participation. We further introduce a prediction-augmented variant, \textsc{OCO-S$^2$-OGD-P}, and show that accurate block-level predictions improve the optimization term in the regret bound through their realized gradient-mismatch error. Overall, this work provides a regret-theoretic foundation for communication-efficient online decision-making in systems where algorithmic updates and physical state trajectories are intrinsically coupled.
comment: Revised version. Reformulated under the OCO-S^2 framework; presentation and proofs improved; main results unchanged
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
Transferring the driveshaft inertia to the grid via the DC-link in MV drive systems
This paper investigates a control approach that renders the driveshaft inertia completely available on the grid side and enhances the fault ride-through behavior of medium-voltage (MV) drive systems. Two main contributions are presented. First, we show how the rotational inertia of the driveline shaft can be synchronously coupled to the grid through a modification of the speed control reference signal and through an adapted DC-link control strategy. For the latter, we pursue two alternatives: one based on conventional cascaded control and another based on synchronous machine (SM) model matching. Second, we demonstrate that both the standard phase-locked loop (PLL) and the matching control approach can be interpreted, via the ray-circle complementarity, as feedback optimization schemes with distinct steady-state maps. This perspective allows us to revisit matching control, reveal its embedded PLL, highlight its current-limiting and tracking capabilities, and provide an extensive simulation study.
comment: Submitted for review to IEEE Transactions on Control Systems Technology, updated to 22 pages
From Forecast to Action: A Deep Learning Model for Predicting Power Outages During Tropical Cyclones
Power outages caused by tropical cyclones (TCs) pose serious risks to electric power systems and the communities they serve. Accurate, high-resolution outage forecasting is essential for enabling both proactive mitigation planning and real-time emergency response. This study introduces the SpatioTemporal Outage ForeCAST (STO-CAST) model, a deep learning framework developed for real-time, regional-scale outage prediction during TC events with high-resolution outputs in both space and time. STO-CAST integrates static environmental and infrastructure attributes with dynamic meteorological and outage sequences using gated recurrent units (GRUs) and fully connected layers, and is trained via a Leave-One-Storm-Out (LOSO) cross-validation strategy along with holdout grid experiments to demonstrate its preliminary generalization capability to unseen storms and grids. The model produces hourly outage forecasts at a 4 km * 4 km resolution and supports dual forecasting modes: short-term nowcasting with a 6-hour lead time via assimilation of real-time observations, and long-term forecasting with a 60-hour lead time based on evolving meteorological projections. A case study on Typhoon Muifa (2022) demonstrates STO-CAST's operational effectiveness, including error decomposition across model design, meteorological uncertainty, and observation gaps, while highlighting the value of real-time data assimilation and the model's capacity to identify evolving outage hotspots. STO-CAST offers a scalable, data-driven solution to support risk-informed emergency response and enhance power system resilience under intensifying TC threats.
comment: Preprint version. The final published article is available in Risk Analysis; accepted on 19 May 2026
Robust Synchronous Reference Frame Phase-Looked Loop (PLL) with Feed-Forward Frequency Estimation
Synchronous reference frame phase-locked loop (SRF-PLL) techniques are widely used for interfacing and control applications in the power systems and energy conversion at large. Since a PLL system synchronizes its output with an exogenous harmonic signal, often 3-phases voltage or current, the locking of the frequency and phase angle depends on the performance of the feedback loop with at least two integrator terms, and on the distortions of the measured input quantities. For the conventional SRF-PLL with a proportional-integral (PI) control in feedback, we are providing a robust design which maximizes the phase margin and uses the normalization scheme for yielding the loop insensitive to the input amplitude variations. The main improvement in the transient behavior and also in tracking of frequency ramps is achieved by using the robust feed-forward frequency estimator, which is model-free and suitable for the noisy and time-varying harmonic signals. The proposed feed-forward-feedback SRF-PLL scheme is experimentally evaluated on the 3-phases harmonic currents from a standard PMSM drive with the varying angular speeds and loads. Both, the tracked angular frequency and locked phase angle are assessed as performance indicators of the proposed SRF-PLL with feedforwarding.
comment: 8 pages, 10 figures, 2 tables
Contract-based hierarchical control using predictive feasibility value functions
Today's control systems are often characterized by modularity and safety requirements to handle complexity, resulting in the use of hierarchical control structures. Although hierarchical model predictive control offers favorable properties, achieving a provably safe, yet modular design remains a challenge. This paper introduces a contract-based hierarchical control strategy to improve the performance of control systems facing challenges related to model inconsistency and independent controller design across hierarchies. We consider a setup where a higher-level controller generates references that affect the constraints of a lower-level controller, which is based on a soft-constrained MPC formulation. The optimal slack variables of the lower-level MPC serve as the basis for a contract that allows the higher-level controller to assess the feasibility of the reference trajectory without exact knowledge of the model, constraints, and cost of the lower-level controller. To ensure computational efficiency while maintaining model confidentiality, we propose using an explicit function approximation, such as a neural network, to represent the cost of optimal slack values. The approach is tested for a hierarchical control setup consisting of a planner and a motion controller as commonly found in autonomous driving.
Convergence Analysis of Natural Power Method and Its Applications to Control
This paper analyzes the discrete-time natural power method, demonstrating its convergence to the dominant $r$-dimensional subspace corresponding to the $r$ eigenvalues with the largest absolute values. This contrasts with the Oja flow, which targets eigenvalues with the largest real parts. We leverage this property to develop methods for model order reduction and low-rank controller synthesis for discrete-time LTI systems, proving preservation of key system properties. We also extend the low-rank control framework to slowly-varying LTV systems, showing its utility for tracking time-varying dominant subspaces.
comment: 6 pages. will be presented at IFAC World Congress 2026
Singularity-free dynamical invariants-based quantum control
State preparation is a cornerstone of quantum technologies, underpinning applications in computation, communication, and sensing. Its importance becomes even more pronounced in non-Markovian open quantum systems, where environmental memory and model uncertainties pose significant challenges to achieving high-fidelity control. Invariant-based inverse engineering provides a principled framework for synthesizing analytic control fields, yet existing parameterizations often lead to experimentally infeasible, singular pulses and are limited to simplified noise models such as those of Lindblad form. Here, we introduce a generalized invariant-based protocol for finite-dimensional state preparation under arbitrary noise conditions. We transform the finite-dimensional control problem into the equivalent problem for a single-qubit, by restricting the dynamics to a designed SU(2) subspace. The control protocol then proceeds in two-stages: first, we construct a family of bounded pulses that achieve perfect state preparation in a closed system; second, we identify the optimal member of this family that minimizes the effect of noise. The framework accommodates both (i) characterized noise, enabling noise-aware control synthesis, and (ii) uncharacterized noise, where a noise-agnostic variant preserves robustness without requiring a master-equation description. Numerical simulations demonstrate high-fidelity state preparation across diverse targets while producing smooth, hardware-feasible control fields. This singularity-free framework extends invariant-based control to realistic open-system regimes, providing a versatile route toward robust quantum state engineering on NISQ hardware and other platforms exhibiting non-Markovian dynamics.
From Discrete to Continuous Highest-earning Imitation Dynamics
Imitating the highest earners is a common decision-making heuristic, but in finite populations it can generate persistent fluctuations between strategies. This paper studies whether such fluctuations persist as population size grows in heterogeneous two-strategy populations. We show that the Markov chains describing the discrete imitation dynamics form generalized stochastic approximation processes for a good upper semicontinuous differential inclusion, which defines the associated mean dynamics. We prove that these mean dynamics always converge to equilibria. Using stochastic approximation results, we then show that the amplitudes of fluctuations in the population proportions of the two strategies vanish almost surely as population size tends to infinity. Thus, in well-mixed large populations, highest-earning imitation is unlikely to produce large-scale perpetual fluctuations.
comment: a more general case
Location-Invariant Assessment of Flexibility Potential under Distribution System Reconfiguration
The growing integration of renewable and decentralized generation increases the need for flexibility in distribution systems. This flexibility, typically represented in a PQ capability curve, is constrained by network limits and topology. Distribution system reconfiguration (DSR) introduces additional degrees of freedom through switching actions. This paper proposes an AC-constrained methodology to assess flexibility under network reconfiguration, explicitly considering radial operation. The impact of topology changes on PQ capability curves, which serve as a measure of flexibility potential, is analyzed. To that end, a novel measure called location-invariant flexibility potential (LI-FP) is introduced. Results show that reconfiguration can significantly influence and improve operational flexibility. The approach presented enables transparency for system operators, facilitating improved coordination of flexibility providers.
Experiments in Agentic AI for Science
This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).
Robotics
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.
comment: Project website: https://dynaflip-robotics.github.io
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field CVPR 2026
We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.
comment: Accepted to CVPR 2026. Project page https://gatech-rl2.github.io/GAVIS/
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.
comment: The first two authors contributed equally
A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption. Our view is that, in simulation-dominated robot control, the essential question is not which processor runs physics, but whether simulation throughput, policy learning, and runtime synchronization form an efficient end-to-end loop. We present UniLab, a heterogeneous CPU-simulation / GPU-learning architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. UniLab is implemented as a complete and extensible training system using MuJoCoUni and MotrixSim CPU-batched physics backends, supporting PPO, SAC, FlashSAC, TD3, and APPO. On representative simulation-based robot control tasks, UniLab improves end-to-end training efficiency by 3--10$\times$ under the same hardware configuration, while reducing dependence on the NVIDIA CUDA-based software stack and supporting cross-platform execution on the Apple macOS platform and the AMD ROCm and Intel XPU accelerator backends. These results show that GPU simulation is an effective path to efficient training, but not a necessary one, broadening the practical system choices available for robot RL training. Project page: https://github.com/unilabsim/UniLab.
Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation
Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.
comment: Project page: https://zuo-kuangji.github.io/Gaze2Act/
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.
comment: 34 pages
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.
comment: 24 pages,11 figures
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance ICML2026
Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.
comment: accepted by ICML2026
Replicable Simulation-Based Robot Validation through Provenance
Robot behavior is often validated through simulation-based testing, yet the replicability of such campaigns depends critically on transparent documentation of how tests are configured, executed, and post-processed. We argue that data provenance, coupled with the FAIR principles (findability, accessibility, interoperability, and reusability), addresses this gap by explicitly tracking links between artifacts and by attaching machine-readable metadata about file origins and key design decisions. Moreover, provenance and metadata cannot be treated as an afterthought confined to final datasets; they must be integrated into the testing processes that generate those datasets so that evidence can be reconstructed end-to-end. We demonstrate this by augmenting an existing simulation-based testing framework with provenance tracking and metadata collection mechanisms, and by using these extensions to enrich a mobile robot navigation dataset with structured provenance and FAIR-aligned metadata. Finally, we discuss obstacles encountered in this integration -- such as vocabulary alignment, attribute selection, and adoption of domain standards -- and provide actionable recommendations for implementing provenance-centric, FAIR metadata in robotics validation workflows.
Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control ICML2026
Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.
comment: ICML2026
DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind
comment: 9 pages, 6 figures
LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation
Multi-step robot manipulation requires acting under uncertainty about how the scene will evolve, making exploration and policy adaptation challenging. We study whether short-horizon, task-consistent future videos can provide useful structured priors for control and reinforcement-learning fine-tuning. We formalize this idea through Future-Experience Conditioning (FEC), a simple interface that conditions closed-loop policies on a latent representation of a short future video. In our simulation setup, future clips are generated in three stages, an LLM reasoner operating over a task ontology initialized from the current scene state, a robot-free digital-twin rollout of the intended object motion, and a mask-free video diffusion model that synthesizes a robot-consistent future clip without requiring segmentation at inference. We instantiate this future-conditioning interface primarily with BC and BC+RL, and compare against a future-conditioned Streaming Flow Policy (SFP) baseline on RoboCasa and CALVIN under NoFuture, GTFuture, GenFuture, and WrongFuture. Generated futures improve performance over no-future conditioning, while mismatched futures degrade it, and our BC+RL instantiation achieves the strongest overall results. An average BC+RL learning-curve analysis across 8 CALVIN tasks further shows that GTFuture improves fastest, GenFuture improves earlier and to a higher level than NoFuture, and WrongFuture remains at zero throughout training. These results suggest that short-horizon future videos can serve as useful structured priors for exploration and policy adaptation under imperfect future predictions. https://enact2026.github.io/
Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation ICRA 2026
Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO
comment: 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)
Joint Angle Estimation with Customized Wristband Based on Online Incremental Learning
Intelligent wearable technology plays an increasingly important role in human-computer interaction, motion, and health monitoring. To ensure comfort and practicality of use, one common form for motion monitoring is to utilize soft wearable sensors. However, many research applications regarding wearable sensors are simplistic and difficult to adapt to different situations. This study proposes a system for estimating the angle of the wrist joint using a customized wristband based on an online incremental learning approach. It is a two-stage estimation method: the first stage updates the model based on the wearer's wrist movement characteristics using online learning, integrating real-time data from an IMU as ground truth. The second stage utilizes the updated model for estimation of wrist joint angle solely with the wristband. In other words, model training is completed during data acquisition, allowing the trained model to be used for subsequent angle estimation. This method offers advantages in adapting to data drift caused by variations in different testing configurations, such as the left and right wrists of the same subject, deviations in the wearing position on the same wrist, and even differences among various subjects. The results indicate that the sensors exhibit good performance under strain variations, and the wrist joint trajectory estimation of the proposed system has an approximate error of 15 degree in different scenarios.
MARS Policy: Multimodality Only When It Matters
Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.
comment: 13 figures, 17 pages
PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology
Real-world evaluation of vision-language-action (VLA) policies still rests on binary success rate at a fixed timeout with $N \le 25$ rollouts per condition, almost always without confidence intervals or paired statistical comparison; these cohort sizes struggle to resolve close comparisons reliably. We introduce PhAIL (Physical AI Leaderboard, https://phail.ai), an open real-robot benchmark on a Franka FR3 (dataset, per-rollout artifacts, and end-to-end reference implementation) of a distributional evaluation methodology: the time-to-success cumulative distribution function (CDF) as the evaluation primitive, with two separated jobs. The first is scoring via Human-Relative Throughput (HRT), a dimensionless scalar with bootstrap confidence intervals, anchored to same-fixture human teleoperation. The second is a significance test (Kolmogorov-Smirnov, computed per-object and macro-averaged across objects). On four publicly-available VLAs, the macro-averaged KS test resolves two close comparisons (GR00T vs. ACT, OpenPI vs. ACT) at $N \le 30$ rollouts per (model, object) cell where binary-threshold metrics do not; the closest pair (OpenPI vs. GR00T) remains unresolved within our budget. The best evaluated VLA is $\sim 7\times$ slower per operation (RMST ratio) than the human reference.
comment: 22 pages, 10 figures, 8 tables. Dataset, analysis pipeline, and paper source: https://phail.ai and https://github.com/Positronic-Robotics/phail-paper
FLIP: Real-Time and Resilient Formation Planning for Large-Scale DIstributed Swarms via Point Cloud Registration
Traditional large-scale formation planning either oversimplify the formation representation which leads to poor performance, or they employ complete collaborative relationships, which results in excessive computational load. To achieve high-performance and large-scale formation planning, we transform the Optimal Formation Position Sequence \cite{c1} (OFPS) calculation problem into a spatiotemporal Point Cloud Registration (PCR) problem. Each agent derives its OFPS by distributively computing the matching result between current positions and the desired formation positions of all other agents. Then each agent optimizes the cooperative formation trajectory by using OFPS. We leverage the PCR method with outlier rejection to rapidly perform large-scale formation position registration. This prevents suboptimal trajectories and failed agents from propagating through the cooperative network and affecting more agents. Consequently, we uniformly achieve resilient, efficient, and distributed trajectory planning for large-scale swarms. The effectiveness and the superiority of the proposed method are demonstrated through large-scale simulations of 120-drone formation, and rigorous benchmarking against state-of-the-art (SOTA) methods.
Momentum Based Reward Design for Low Emission Traffic Signal Control
Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.
EXACT-MPPI: Exact Signed-Distance Navigation for Arbitrary-Footprint Robots from Point Clouds via Path Integral Control
Ground robots often carry payloads, implements, or other attachments that turn their effective footprint into complex, non-convex shapes. Navigating safely through clutter then requires reasoning about this true geometry, yet most local planners simplify it with convex or inflated proxies and rasterize sensor data into occupancy grids or distance fields. Both choices eliminate feasible motions when clearance is comparable to the footprint geometry. We present EXACT-MPPI, a training-free local navigation framework that maps local point-cloud observations and sparse guidance directly to motion commands, without any intermediate map representation. The framework embeds an analytic, exact signed-distance evaluator into a Model Predictive Path Integral (MPPI) controller. The footprint is represented as a simple polygon for general convex or concave planar shapes, with a rectangle-cover specialization for faster evaluation of rectilinear footprints, enabling footprint-aware collision costs without convex decomposition, inflation, or learned encoders. During each MPPI rollout, observed obstacle points are transformed into the predicted body frame and evaluated against the footprint. All operations are batched in JAX, leveraging GPU parallelism for real-time receding-horizon control. Experiments show that EXACT-MPPI accelerates batched distance evaluation over a learned point-to-robot baseline, preserves feasible motion where convex-footprint planners fail, and remains robust under dense static and moving obstacles. The same framework deploys on differential-drive, Ackermann, omnidirectional, and hybrid-mode platforms by changing only the footprint description and motion model without per-platform training. Pairing exact footprint geometry with sampling-based predictive control thus offers a practical, training-free path to footprint-aware local navigation across diverse robots.
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models
Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.
comment: 11 pages, 7 figures
How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments
Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.
comment: 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models
Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.
comment: 12 pages, 3 figures, journal
From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments
Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.
comment: 8 pages, 5figures
VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation
When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.
Planning with the Views via Scene Self-Exploration
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models
Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.
ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models
Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.
3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding
Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.
A Progress-Aware Leader-Follower Midair Docking System for Dual-Drone Aerial Manipulation
Reliable midair docking between small unmanned aerial vehicles (UAVs) is essential for modular aerial cooperation and manipulation, but it requires precise relative-pose control and repeatable platform under tight thrust and payload constraints. We present a dual-drone docking platform where two quadrotors operate in a leader-follower formation and dock using a lightweight modular frame with passive magnetic latching. A progress-aware mission supervisor manages phase transitions: approach, alignment, capture, and settle. This platform integrates a complete hardware-software stack (ROS 2 with Crazyflie/PX4 interfaces) and synchronized logging for benchmark evaluation. We evaluate the platform in simulation and real-world experiments using quantitative metrics such as formation error, baseline and yaw consistency, docking success rate, time-to-dock, and failure-mode statistics. The platform enables statistically grounded comparison of docking supervision and synchronization strategies and provides a practical testbed for modular aerial cooperation and repeatable midair aerial manipulation.
comment: This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China
Decoupled Thrust-Axis Attitude Control Using Quaternions for Chandrayaan-3 Lunar Landing Mission
Chandrayaan-3 mission achieved a historic milestone with its successful soft landing near the lunar south pole, highlighting the critical role of the navigation, guidance, and control (NGC) system. Navigation provided vehicle state estimates relative to the Moon center, while a polynomial based guidance scheme computed the required acceleration profile to meet terminal landing conditions. This acceleration demand was translated into total thrust magnitude and attitude commands generation. Attitude command generation involved aligning the thrust axis with the required acceleration vector and constraining rotation about the thrust axis, typically governed by mission-specific requirements. Although quaternion-based control laws are preferred for their singularity-free representation, they inherently couple all three rotational axes. This coupling can lead to undesirable interactions between guidance and control, especially during large rotations about the thrust axis, due to the quaternion shortest-path property. This paper proposes a novel quaternion-based decoupling method that enables independent thrust-axis control, mitigating guidance-control interaction and ensuring proper attitude commands generation for lander attitude control.
comment: 6 pages, 7 figures, Published in Indian Control Conference 2025
Phase-Conditioned Imitation Learning with Autonomous Failure Recovery for Robust Deformable Object Manipulation
This paper presents a phase-conditioned, force-aware framework for robust deformable object manipulation. Standard imitation learning policies such as Action Chunking with Transformers (ACT) rely on a Markovian assumption at inference, causing state aliasing when visually similar observations require contradictory actions and preventing autonomous recovery from execution failures. We address this with a closed-loop hierarchical architecture. A FiLM-conditioned ACT encoder modulates feature extraction based on the current task phase, enabling a single unified policy to produce phase-specific behaviors while sharing action dynamics across phases. A multi-modal phase predictor fusing visual, force, and pose feedback estimates the phase in real time, detecting contact failures that are invisible to vision alone and autonomously triggering recovery trajectories. The system is completed by a hybrid impedance controller for compliant execution and a haptic teleoperation interface for force-aware data collection. Ablation studies show that FiLM-based modulation significantly outperforms both unconditioned and token-level conditioned baselines, and t-SNE analysis confirms that FiLM induces well-separated, phase-specific feature representations. Validated on hanging and removing a T-shirt with dual arms, the closed-loop system improves the hanging success rate from 56\% to 87\% through autonomous error recovery. Code and videos: https://leledeyuan00.github.io/phaser/
comment: Accepted to IEEE/ASME Transactions on Mechatronics
Decentralized LLM-Driven Coordination of Acoustic Robots for Contactless Object Manipulation
Natural language interfaces can simplify interaction with multi-robot systems, especially when non-expert users need to issue high-level commands. Acoustic manipulation using ultrasonic phased arrays also enables contactless object handling for applications such as healthcare, laboratory automation, and precision transport. However, combining large language models (LLMs) with distributed acoustic mobile robots remains underexplored. This paper presents a decentralized framework for natural language-driven coordination of acoustic robots for contactless object manipulation. The system converts spoken instructions into executable multi-robot task plans using Whisper-based speech recognition, LLM-based semantic parsing, structured JSON task representation, and distributed scheduling. The JSON schema encodes robot assignments, temporal dependencies, spatial constraints, and synchronization requirements for sequential, parallel, and synchronized execution. The system is implemented on two TurtleBot3-based acoustic robots, each equipped with an ultrasonic phased array for contactless object transport. Experiments were conducted in three scenarios: sequential execution, parallel multi-robot transport, and synchronized cooperative manipulation. The system achieved task success rates of 96 percent for sequential tasks, 86 percent for parallel execution, and 70 percent for synchronized collaborative transport. These results show that natural language commands can be transformed into distributed robot actions for contactless manipulation, highlighting the potential of LLM-driven automation for human-robot interaction in distributed robotic systems.
comment: This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China
The Open Motion Planning Library 2.0
The Open Motion Planning Library (OMPL), first released in 2008, has become a cornerstone of the motion planning community, providing implementations of a wide range of state-of-the-art sampling-based algorithms. Over almost two decades of continuous development, we have steadily expanded the library with new planners, state spaces, and problem formulations. These additions range from asymptotically optimal and lazy planners to constrained motion planning and planning with temporal-logic goals. Building on this foundation, we introduce OMPL 2.0, a major evolution of the library that targets real-time motion planning through hardware acceleration and integrates seamlessly with modern AI research workflows. We also reflect on how OMPL and the field of motion planning have grown together over the years, and discuss the library's broader impact on the research community.
MonoDuo: Using One Robot Arm to Learn Bimanual Policies ICRA
Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.
comment: Accepted to appear in the 2026 IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 1-5 June 2026
Extreme dynamic symmetry enables omnidirectional and multifunctional robots
Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.
comment: Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus
Distributed Non-Uniform Scaling Control of Multi-Agent Formation with Dynamic Agent Joining
Non-uniform scaling control of formation enables multi-agent systems to adjust their shape by scaling with different ratios along different coordinate axes, offering enhanced flexibility in complex environments. However, like most existing formation maneuver strategies, it typically assumes a fixed set of agents, limiting its applicability in scenarios requiring dynamic team expansion. This paper introduces a distributed control framework that enables a formation to incorporate new agents during non-uniform scaling maneuvers in arbitrary dimensions while preserving the spectral properties of the graph Laplacian. Simulation examples validate the effectiveness of the theoretical results.
comment: This paper has been accepted by IFAC 2026
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $σ$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $ε$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $π_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.
Bidirectional Incremental Generalized Hybrid A*
We focus on the problem of efficient anytime kinodynamic planning for systems with complex dynamics in unstructured environments that make precomputing motion primitives infeasible. Directly applying A* to such problems is computationally infeasible due to the curse of dimensionality. Methods such as Hybrid A* addressed this burden by discretizing the state space, but in turn creating a coupling between tree discovery and the discretization resolution. The Incremental Generalized Hybrid A* (IGHA*) performs search over a hierarchy of resolutions in an anytime fashion to break this coupling, by freezing vertices to use in later search iterations rather than pruning them. However, the frozen vertices can hide solution-supporting vertices from the search at a particular iteration. While classical bidirectional search is motivated by the reduction of search depth, extending IGHA* into the bidirectional setting (termed Bi-IGHA*) obtains additional benefit by fundamentally mitigating the behaviour induced by frozen vertices hiding solutions. We show that Bi-IGHA* preserves IGHA*'s guarantees on monotonic cost improvement and termination. We empirically show that Bi-IGHA* substantially reduces expansions on R3, R4, and R6 planning problems, and achieves equivalent closed-loop performance with kinodynamic planning for high-speed off-road autonomy while requiring significantly fewer expansions. Website: https://personalrobotics.github.io/IGHAStar/biighastar.html
PInVerify: An Offline Embodied Benchmark for Active Instance Verification CVPR 2026
Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.
comment: Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify
Exploiting Chordal Sparsity for Globally Optimal Estimation with Factor Graphs
Robust and efficient state estimation is crucial for perception, navigation, and control in robotics. State estimation problems are conveniently modeled using the factor-graph framework as enabled by modern software packages such as GTSAM or g2o. However, the standard solvers included in such frameworks are local and may converge to poor local minima, posing significant safety concerns. Conversely, techniques based on convex relaxations have been shown to provide a means of globally solving or certifying many state estimation problems. However, these relaxations 1) often require substantial effort to formulate, and 2) may incur significantly higher cost compared to efficient local solvers, as they require solving a large semidefinite program (SDP). In this work, we address both shortcomings by 1) creating a new procedure within the GTSAM framework for automatically constructing convex SDP relaxations for any factor graphs with common factor and variable types, and by 2) exploiting the Bayes tree constructions native to GTSAM to decompose the SDP problem, leading to significant speedup in solver time for chordally sparse problems. We demonstrate the favorable scaling of this structure-exploiting global estimator compared to standard local solvers for two case studies: A 3D pose-graph SLAM problem with a ring factor graph and a 2D localization problem with a chain factor graph. The software framework is available at https://github.com/borglab/gtsam.
ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning
Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.
comment: 7 pages, 5 figures, 5 tables. Submitted to IEEE RA-L
Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering ICRA 2026
We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce
comment: Accepted at ICRA 2026
Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes
Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.
comment: Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.
Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity
Robotic manipulation dexterity is often pursued by building increasingly complex high-DoF multifingered hands. While many robotic hands are designed to replicate human morphology, the functional role of human hands suggests a different perspective: much of their complexity may exist to enable tool use and tool making. This observation motivates Any-ttach, a tool-centric manipulation framework that treats quick end-effector swapping as a mechanism for dexterity with simplicity. Any-ttach combines a low-cost automatic swapping mechanism for an open-close robot interface, a handheld device for collecting human demonstrations, and a task planning framework that composes learned, parameterized, and planned tool-use skills. The system supports diverse tools and end-effector modules, including daily tools, articulated tools such as scissors, Fin Ray fingers, and a low-cost anthropomorphic hand, through the same shared interface. Our experiments show that Any-ttach improves tool-swapping reliability, increases demonstration efficiency, reduces tool-pose variability, and supports diverse tool-use skills. In two long-horizon tasks, making a sandwich and preparing a cucumber, Any-ttach executes six tool-use subskills through end-effector switching and execution monitoring. These results suggest that robots can expand manipulation capability not only through more complex end-effectors, but also through rapidly exchangeable tools and end-effector modules. More details and videos are available at https://any-ttach.github.io/.
ARISTO Hand: Sensing-Driven Distal Hyperextension for Fine-Grained Manipulation
Manipulating thin objects requires precise contact geometry and reliable force perception, yet many anthropomorphic robotic hands lack the mechanical and sensing capabilities needed for such interactions. We present the ARISTO Hand, a tendon-driven robotic hand that integrates active distal hyperextension with a hybrid fingertip-sensing architecture that combines a rigid, nail-mounted force-torque sensor and a soft capacitive tactile array. Active hyperextension enables controlled fingertip engagement beyond the kinematic limits of standard flexion, increasing pull-out force by 2.76x for object thicknesses of 1-20 mm while preserving the nominal grasp capability. The rigid nail-mounted sensor provides reliable force measurements during edge contacts, where the sensitivity of proprioceptive force estimation degrades as the contact geometry approaches kinematic singularities. We validate the proposed architecture through quantitative force characterization and a multi-stage SD card extraction and insertion task. Video and supplementary materials are available at: https://aristohand.github.io
VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.
Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics
Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.
CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems
Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.
ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation
Vision-Language-Action (VLA) models have shown promise for robotic manipulation, yet most existing policies operate reactively by directly regressing actions from current observations, without explicitly modeling future dynamics. This limits their ability to generalize under out-of-distribution perturbations. To address this issue, we propose ELAN4D, an embodiment-centric, 4D-aware training framework that enhances VLA policies with future robot keypoint tracks as predictive spatio-temporal supervision. Using only forward kinematics from proprioceptive states, we derive 3D displacement tracks of robot keypoints, such as joints and the end-effector, with negligible preprocess cost. These tracks provide metric and compact supervision without requiring external trackers or reconstruction. A plug-and-play auxiliary branch with a lightweight track decoder injects this 4D signal into the action expert while preserving the pretrained vision-language backbone through gradient isolation. The track decoder is discarded during inference, leaving the base policy interface unchanged. Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0 and real-world manipulation tasks demonstrate that ELAN4D consistently improves over strong VLA baselines, achieving the best overall performance and substantial gains under out-of-distribution perturbations, including camera, background, and layout shifts. These results highlight the effectiveness of embodiment-centric 4D supervision for building more robust and generalizable manipulation policies.
Learning-Based Navigation for Indoor Mobile Robots
This paper presents a learning-based navigation framework for indoor mobile robots. The proposed method combines a supervised neural global planner, trained from cost-aware A* expert trajectories, with the proposed Learning-Based DWA local planner, which is formulated as discrete candidate selection over the Dynamic Window Approach (DWA) action lattice. For local planning, the policy is first trained by behavior cloning and then refined by Proximal Policy Optimization (PPO) under feasibility-aware masking. The framework is implemented and evaluated in both simulated and real-world indoor environments. Experimental results show that the proposed method generates feasible global routes and reliable local motion commands for safe goal-directed navigation in the presence of obstacles. These results demonstrate the effectiveness of integrating learning-based global planning with reinforcement-learning-refined local control for indoor mobile robot navigation. The source code will be released at https://ntdathp.github.io/rl_robot_web/.
Structured interactions improve distributed coordination beyond model scaling in a real-world multi-robot system
Scaling individual robot capabilities is common but costly. Here we investigate a system-level design question in real-world multi-robot coordination: given matched hardware budgets, does restructuring communication among robots yield larger gains than increasing onboard model size? Using a representative transport-and-mapping task with 10 physical robots (5 runs per condition, 60 runs total), we find that switching from fully connected to modular hierarchical interactions improves normalised performance by 47 points (0--100), whereas doubling neural network hidden size yields at most 9 points. Nested mixed-effects model comparisons show a substantially larger improvement in model fit for topology than for scale. The pattern is confirmed in independent SMAC replications; heterogeneous benchmark reanalyses provide secondary supporting consistency checks rather than primary evidence. Performance saturation beyond 1024 hidden units is observed in simulation-calibrated extrapolation, not directly on hardware. These results indicate that interaction structure can play a dominant role within the tested system and task setting, while broader quantitative generalisation remains to be established.
Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation
Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.
ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling
Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.
comment: Project website: https://schedulestream.github.io
Accelerating trajectory optimization with Sobolev-trained diffusion policies
Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.
TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance
Despite extensive developments in motion planning of autonomous aerial vehicles (AAVs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. To address these challenges, we present TRUST-Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST-Planner outperforms baseline competitors, achieving a 96\% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method.
comment: Accepted by IEEE Transactions on Industrial Electronics (TIE) for publication. The final version will be available online at https://ieeexplore.ieee.org/ after publication
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
SM2ITH: Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control ICRA
Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM$^2$ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the Ridgeback-UR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty
Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.
Quasi-Static Control of Discrete Cosserat Rod
In this paper, we design feedback control laws for soft robots modelled using the Cosserat rod, which is spatially discretised using the Piecewise Constant Strain (PCS) approach. The PCS approach transforms the nonlinear PDEs describing the Cosserat rod to a system of nonlinear ODEs. This simplification results in a model describing soft robots which is similar to the serial rigid-link manipulators. We design feedback control laws for the quasi-static PCS model by using the external wrenches as control input. The control laws are designed based on state-feedback linearisation in strain and task spaces. An extensive set of numerical results demonstrates the performance of the control laws for end-effector trajectory tracking and shape control of soft robots.
comment: Submitted to 17th APCA International Conference on Automatic Control and Soft Computing (CONTROLO 2026)
Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes
This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.
GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. However, standard action-imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current-frame 3D spatial structure and short-horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene-flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test-time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state-of-the-art performance across multiple robotic manipulation benchmarks, reaching \textbf{98.4\%} on LIBERO, \textbf{54.8\%} on RoboCasa Human-50, and \textbf{50.0\%} on real-robot tasks. Compared with existing 3D-enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video-based world-model approaches.
comment: 19 pages, 9 figures
Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation
Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.
comment: Added a manuscript footnote stating "Project page with supplementary videos: https://ces40320.github.io/WebHomepage__Walk-RL ."
Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells
Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.
comment: Accepted for publication at IEEE CASE 2026
A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach
Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.
comment: 44 pages, 14 figures
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos
Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo
comment: Project page: https://humanego-ai.github.io
Multifingered force-aware control for humanoid robots ICRA 2026
In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth force measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a $82.7\%$ success rate, and further evaluate it in multi-object scenarios, achieving $80\%$ accuracy. Code and data can be found here https://github.com/hsp-iit/multifingered-force-aware-control.
comment: This work has been accepted for publication in ICRA 2026
Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild IROS 2025
Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.
comment: 8 pages, 8 figures, submitted to IROS 2025
SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting
LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
comment: Project page: https://lfranke.github.io/surffill
Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning
A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.
comment: Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL
Environment-Adaptive Solid-State LiDAR-Inertial Odometry
Solid-state LiDAR-inertial SLAM has attracted significant attention due to its advantages in speed and robustness. However, achieving accurate mapping in extreme environments remains challenging due to severe geometric degeneracy and unreliable observations, which often lead to ill-conditioned optimization and map inconsistencies. To address these challenges, we propose an environment-adaptive solid-state LiDAR-inertial odometry that integrates local normal-vector constraints with degeneracy-aware map maintenance to enhance localization accuracy. Specifically, we introduce local normal-vector constraints to improve the stability of state estimation, effectively suppressing localization drift in degenerate scenarios. Furthermore, we design a degeneration-guided map update strategy to improve map precision. Benefiting from the refined map representation, localization accuracy is further enhanced in subsequent estimation. Experimental results demonstrate that the proposed method achieves superior mapping accuracy and robustness in extreme and perceptually degraded environments, with an average RMSE reduction of up to 12.8% compared to the baseline method.
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model ICML 2026
Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.
comment: Accepted at ICML 2026. Project page at https://periphanes.github.io/dust (20 pages, 10 figures)
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce \textbf{VLA-ATTC}, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel \textbf{Relative Action Critic} (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50\%. We will open-source all the code and weights.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.
Contrastive Representation Regularization for Vision-Language-Action Models ICML 2026
Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.
comment: ICML 2026
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation
We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.
TACO: Temporal Consensus Optimization for Continual Neural Mapping
Neural implicit mapping has emerged as a powerful paradigm for robotic navigation and scene understanding. However, real-world robotic deployment requires continual adaptation to changing environments under strict memory and computation constraints, which existing mapping systems fail to support. Most prior methods rely on replaying historical observations to preserve consistency and assume static scenes. As a result, they cannot adapt to continual learning in dynamic robotic settings. To address these challenges, we propose TACO (TemporAl Consensus Optimization), a replay-free framework for continual neural mapping. We reformulate mapping as a temporal consensus optimization problem, where we treat past model snapshots as temporal neighbors. Intuitively, our approach resembles a model consulting its own past knowledge. We update the current map by enforcing weighted consensus with historical representations. Our method allows reliable past geometry to constrain optimization while permitting unreliable or outdated regions to be revised in response to new observations. TACO achieves a balance between memory efficiency and adaptability without storing or replaying previous data. Through extensive simulated and real-world experiments, we show that TACO robustly adapts to scene changes, and consistently outperforms other continual learning baselines. Code is available at https://iconlab.negarmehr.com/TACO
comment: In: Robotics: Science and Systems (RSS 2026)
Scensory: Real-Time Robotic Olfactory Perception for Joint Identification and Source Localization
While robotic perception has advanced rapidly in vision and touch, enabling robots to reason about indoor fungal contamination from weak, diffusion-dominated chemical signals remains an open challenge. We introduce Scensory, a learning-based robotic olfaction framework that simultaneously identifies fungal species and localizes their source from short time series measured by affordable, cross-sensitive VOC sensor arrays. Temporal VOC dynamics encode both chemical and spatial signatures, which we decode through neural networks trained on robot-automated data collection with spatial supervision. Across five fungal species, Scensory achieves up to 89.85% species accuracy and 87.31% source localization accuracy under ambient conditions with 3-7s sensor inputs. These results demonstrate real-time, spatially grounded perception from diffusion-dominated chemical signals, enabling scalable and low-cost source localization for robotic indoor environmental monitoring.
comment: Our project website is at: http://generalroboticslab.com/Scensory
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning ICML 2026
We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.
comment: ICML 2026
Force Sensing for Wearable Human-Robot Interfaces via Fluidic Innervation
Mechanically characterizing the human-machine interface is essential to understanding user behavior and optimizing wearable robot performance. This interface has been challenging to sensorize due to manufacturing complexity and non-linear sensor responses. Here, we measure human limb-device interaction via fluidic innervation, creating a 3D-printed silicone pad with embedded air channels to measure forces. As forces are applied to the pad, the air channels compress, resulting in a pressure change measurable by off-the-shelf pressure transducers. We demonstrate in benchtop testing that pad pressure is highly linearly related to applied force ($R^2 = 0.998$) and confirmed strong linear relationships to isometric knee torque in a clinical dynamometer with strategic pad placement. We built on these idealized settings to test pad performance in more unconstrained settings, including during cyclic dynamic and stepwise isometric bicep curls. Finally, we integrated the sensor into a lower-extremity robotic exoskeleton and recorded pad pressure during repeated squats with the device unpowered. Pad pressure tracked squat phase and overall task dynamics consistently. Collectively, our preliminary results suggest fluidic innervation is a readily customizable sensing modality with high signal-to-noise ratio and temporal resolution for capturing human-machine interaction. In the long-term, this modality may provide an alternative real-time sensing input to control / optimize wearable robotic systems and to capture user function during device use.
comment: 6 pages, 7 figures, accepted to BioRob 2026
Phantom: Training Robots Without Robots Using Only Human Videos
Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates-up to 92%-on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.
comment: Project website at https://phantom-human-videos.github.io
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.
comment: Robotic World Model, Video Generative Model
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
comment: Project page: https://nvlabs.github.io/SpaCeFormer/
TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human Groups
Robots navigating human-populated environments must avoid collisions while respecting the social structure of crowds, particularly the implicit boundaries of social groups. Most navigation approaches model humans as independent individuals,causing socially disruptive behavior even when collision-free. This paper presents TAGA (Tangent Action for Group Avoidance), detected group boundaries via tangent-path maneuvers without modifying the underlying navigation policy. A hierarchical safety controller coordinates group-level avoidance with individual collision prevention. We propose the Group Crossing Rate (GCR), a continuous metric measuring the fraction of timesteps the robot spends inside any group convex hull, providing finer-grained social compliance assessment than terminal metrics alone. We introduce a realistic crowd simulation benchmark with five empirically grounded phases: individual speed heterogeneity, group speed coupling, F-formation static groups, leader-follower dynamics, and convex-hull boundaries, evaluated under both ORCA and Social Force pedestrian dynamics. Experiments across ORCA, Social Force, DS-RNN, and Intention-RL reveal a reactive-learning asymmetry: TAGA provides the largest gains for classical reactive baselines (up to +8pp success rate, GCR halved) with near-zero cost for learned policies. These findings offer actionable guidance for when modular group-awareness adds value versus when end-to-end group-aware training is preferable.
comment: 8 pages, 3 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)
Multi-Robot Box Transport over Different Surfaces with Decentralized Role-based Proportional Control
Collaborative transport of objects via pushing by multiple robots has many applications, ranging from construction and warehouse environments to post disaster debris clean-up. Achieving collaborative transport over surfaces with different inclination and friction properties however poses unique challenges. To address these challenges, this paper presents an asynchronous decentralized task and motion planning approach for transporting rectangular boxes of varying mass over flat, uphill and downhill terrain. Such a decentralized approach alleviates communication, synchronization and consensus needs and mitigates single point of failure issues. Our approach, called R2P2 or Roles with Rules and Proportional-control Primitive, assigns roles (e.g., push, support and prevent) to robots based on rules cognizant of the mode of manipulation needed (box rotation vs translation); this is followed by either rule-based control or proportional control of robot velocity based on the roles. Each robot is assumed to observe the location and heading of self and the box in executing the role and controls. R2P2 is evaluated with a six-robot team deployed in a simulator built using NVIDIA IsaacSim -- demonstrating generalizability across different surface friction/inclination and box mass scenarios, and better success rate compared to a standard virtual-leader-follower method. R2P2 is also successfully validated with a physical experiment, where it is executed onboard four turtlebots tasked with moving a 1.2 kg box.
comment: Accepted for presentation at the 2026 ASME IDETC-CIE
VR-DAgger: Immersive VR for Dexterous Data Collection and Uncertainty-Guided On-Policy Correction
Learning from demonstrations is effective for robotic manipulation, but collecting sufficient task-specific data remains a major bottleneck. Under distribution shift, small errors compound, performance degrades, and expert time is often spent on redundant, low-value corrections instead of the few critical failure cases. We present VR-DAgger, a human-in-the-loop framework centered on an immersive VR application for dexterous teleoperation, demonstration collection, and selective policy correction. The VR client provides intuitive hand control with synchronized scene visualization, while a backend workstation runs simulation and learning, enabling autonomous rollouts without continuous operator oversight. We use Monte Carlo (MC) dropout to score uncertainty during Isaac Lab rollouts of a diffusion policy and select informative failure segments for correction. These segments are replayed in VR as clips, where the operator selectively labels and corrects the policy's behavior, concentrating supervision where uncertainty is highest without full-rollout monitoring or a separate intervention classifier. We evaluate on three dexterous manipulation tasks (Pan pick-and-place, Drawer opening, Valve turning) with a 10-DoF XHand under standard and challenging initial configurations. Active labeling consistently improves over behavioral cloning across all tasks, with gains of up to 23 percentage points. Compared to unguided human-in-the-loop inspection, VR-DAgger reduces per-sample collection time by approximately 40% by focusing review on selected segments rather than full rollouts.
Multiagent Systems
SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents
Software engineering (SWE) agents are transitioning from code generation to full software development lifecycle automation. A critical phase in this lifecycle is specification design: transforming initial proposals into carefully considered requirements through expert review. Existing benchmarks such as SWE-Bench are implementation-focused by measuring the agent's ability to generate code given fixed, precise design requirements. This formulation assumes specifications are correct and complete. In real-world complex and critical software systems, initial specifications are often incomplete and flawed, requiring extensive expert reviews and revisions before being accepted for implementation. To fill this gap, we introduce SpecBench to evaluate specification-level reasoning: the ability to generate complete, unambiguous, consistent, and correct system specifications. SpecBench tasks are derived from the Request for Comments (RFC) process used by mature open-source projects. For each task, an agent is given an initial design proposal, the project codebase, and all past project RFC discussions. The agent is tasked with identifying specification deficiencies: omissions, ambiguities, inconsistencies, or incorrect assumptions in the initial proposal. We evaluate predictions against critiques raised by expert maintainers during historical RFC reviews. SpecBench contains tasks from 5 diverse repositories: Kubernetes, React, Rust, TVM, and vLLM. We evaluate state-of-the-art SWE agents on SpecBench, analyzing their capacity to reason about system design without execution feedback. The best performing agent, GPT-5.4, achieves 44.4% accuracy.
EASE Configuration Facilitates A Reproducible Science of LLM Social Simulations NeurIPS 2026
LLMs are increasingly deployed to simulate social interactions, yet many of the existing simulators remain ad hoc and monolithic. This lack of architectural standardization prevents reproducible research and complicates downstream evaluation. We advance a rigorous science of LLM-based multi-agent simulation by modularizing core components into Environments, Agents, Simulation engines, and Evaluation metrics (EASE). We demonstrate the utility of EASE configuration by wrapping it in an experimental study schema for orchestrating workflows centered around answering explicit research questions in generated scenarios. We contribute SiliSocS, an open-source, research-ready Silicon Society Sandbox implementing a study-structured EASE configuration to enable highly configurable and reproducible LLM-based social simulations. Using SiliSocS and EASE, we present three case studies, showcasing the system's comprehensive assessment of existing questions, ability to dive deeper into complex questions, and elaboration of existing studies, respectively. Together, these case studies highlight the limitations of current modeling approaches and isolate the impacts of design choices on key results.
comment: 22 pages, 5 figures, under review at NeurIPS 2026
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization
While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.
comment: 15 pages, 4 figures, 6 tables
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundational models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
comment: Accepted at FaccT 2026
AgentSchool: An LLM-Powered Multi-Agent Simulation for Education
Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.
comment: 39 pages, 10 figures
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems ICML 2026
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.
comment: 30 pages, 16 figures. Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas
We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.
comment: Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026
On the Geometry of Games and their Solvers
A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.
comment: 10 pages, 3 figures, 8 tables. Extends Willis et al. (arXiv:2501.16173). Code and n=500 replication package: https://github.com/arqFranciscoLeon/evollm (archived: https://doi.org/10.5281/zenodo.20248615)
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems
LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence ICML 2026
The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.
comment: Accepted at ICML 2026. 12 pages main text, 16 pages appendix
AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning
Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.
CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems
Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.
DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration
Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.
LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning
Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual design effort. Large language models (LLMs) provide a promising alternative for flexible learning-signal design, yet existing LLM-based methods remain largely single-agent-oriented, one-shot, or weakly validated for the evolving training dynamics of cooperative MARL. To address these limitations, we propose LLM-ALSO, an iterative LLM-driven adaptive learning-signal optimization framework for MARL. Rather than directly deploying LLM-generated rewards, LLM-ALSO decomposes adaptation into iterative diagnosis, proposal, and validation: a Critic LLM diagnoses stage-specific learning and coordination failures from sparse-return metrics and compact behavior evidence, a Generator LLM proposes candidate reward-shaping configurations conditioned on the diagnosis, and branch-validation feedback refines candidates before they affect the main training trajectory. Through short-horizon validation and stage-aware adaptation, LLM-ALSO promotes only validated updates into training, reducing the risk of unreliable LLM-generated modifications. Experiments on sparse-reward cooperative MARL tasks show that LLM-ALSO improves sparse-evaluation performance and learning efficiency.
comment: 14 pages, 6 figures, 6 tables
MATraM: A Multi-Activity Transport and Mobility Agent-Based Model for Activity Modifications
This paper introduces the Multi-Activity Transport & Mobility (MATraM) Agent-Based Model (ABM), a novel framework designed to advance activity-based transport modelling by incorporating dynamic activity adaptation. Traditional transport models simulate system performance using varying levels of abstraction, including flow-based, queue-based, and interaction-based mobility representations. While these approaches differ in their treatment of movement and congestion, they typically rely on pre-defined trip patterns that limit responsiveness to changing conditions. In particular, conventional activity-based models generate trips from fixed daily schedules, constraining their ability to capture behavioural flexibility and uncertainty. MATraM addresses this limitation by enabling agents to flag activities modification requests in response to sub-optimal travel conditions, such as increased travel times. By coupling with an activity scheduling and modification framework, the model integrates adaptive decision-making into the generation and execution of daily activity schedules. This allows for a more realistic representation of how individuals adjust their behaviour in response to transport system dynamics, leading to emergent mobility and congestion patterns. The ABM is presented following the ODD protocol, outlining its purpose, structure, and implementation. MATraM includes detailed representations of agents, their activity schedules, and the transport network, alongside submodels governing routing, scheduling, and behavioural adaptation. By bridging activity-based modelling with interaction-based mobility simulation, MATraM provides a flexible and extensible platform for exploring transport dynamics under uncertainty. This work contributes to the development of next-generation transport models capable of capturing the complex interplay between individual behaviour and system-level outcomes.
comment: 24 pages, 4 figures, 9 tables, working paper for a submission to MethodsX journal
A Theory-Guided LLM Pedagogical Agent for STEM+C Scaffolding Without Over-Reliance
LLM pedagogical agents are proliferating, yet recent findings have raised questions about their adherence to established theories of learning and, by extension, their educational value. Concerns regarding cognitive offloading, over-reliance, and "gaming" behaviors persist and remain largely unaddressed. In response, we developed Copa, an agentic, multi-agent, multimodal Collaborative Peer Agent for STEM+C learning. Copa is built on top of the Evidence-Decision-Feedback (EDF) framework, grounding its interactions in Social Cognitive Theory and Social Constructivism and promoting sense-making through adaptive, dialogic support rather than answer-seeking. In an authentic high school computational-modeling study (n=33 dyads), we demonstrate that Copa (1) supports students' confidence building and ability to verbalize conceptual understanding without causing dependence; and (2) provides adaptive feedback personalized to learners that is interpretable with respect to students' multimodal input data. These findings position theory-guided, multimodal LLM agents as a promising path toward classroom AI integration that amplifies students' reasoning rather than replacing it.
comment: Submitted to Computers & Education. Currently under review
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
comment: Ongoing work
Delayed Repression and Emergent Instability in Adaptive Multi-Agent Systems
Regulatory institutions (from content moderation platforms to financial supervisors) observe, deliberate, and intervene only after a characteristic delay. We ask whether this processing lag alone can destabilize a multi-agent system that would otherwise remain stable, without exogenous shocks, coordination among agents, or malicious actors. We study this question in two stages. First, we analyze a delayed replicator equation in which autonomous agents receive a benefit from radical behavior but face punishment based on a lagged institutional alarm signal. We derive a closed-form critical delay threshold beyond which the unique interior equilibrium loses stability through a Hopf bifurcation, and prove via center manifold reduction that the bifurcation is supercritical (producing bounded oscillations, not explosive growth) for the entire sigmoid response-function family. Second, we embed $N=240$ agents on a network and equip them with reinforcement learning (tabular Q-learning), comparing three decision architectures in a factorial design: non-reactive agents (fixed policy), reactive agents (threshold heuristic without memory), and Q-learning agents (adaptive with cumulative value estimates). The results reveal a hierarchy opposite to the naive expectation that learning amplifies instability: non-reactive agents are immune to delay (0% runaway across all tested values), reactive agents collapse catastrophically (96% runaway by delay $\geq 8$ steps), and Q-learning agents achieve partial resilience (66% runaway at delay $= 20$). The destabilizing ingredient is reactivity to delayed signals: agents that immediately exploit low-alarm windows trigger oscillatory feedback loops. Learning buffers this through implicit punishment memory encoded in Q-values
comment: 30 pages, 10 figures, 2 appendices. Code: https://github.com/YehudaItkin/delayed-repression-instability
Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate
Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.
comment: Master's thesis
CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.
KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition
KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives: (1) a four-gate inbound apply pipeline; (2) an only-tighten composition algebra over a three-channel multi-tenant hierarchy; (3) KYP (Know Your Principal), a schema-level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction-multiplier amplification over an AIVSS-shaped additive baseline; and (5) two-axis delegation attribution: a static premium for risky delegates and a runtime debit for actual delegate misbehavior in multi-agent fan-out. Together these span three pillars (trust, governance, and evidentiary assurance), making an autonomous system's actions authorized, policy-conforming, and post-hoc verifiable: where observability answers how long, how much, and what path, KYA answers was it authorized, did it conform, and can it be verified; it composes with observability rather than replacing it. It ships native adapters for 15+ agent frameworks. On a 4 by 9 cross-backend matrix all 36 cells pass; the pure-function scorer runs sub-millisecond at p99 and the system sustains ~ 1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end-to-end. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently-published topology-guided multi-agent attack. The system is available under Apache 2.0 as the veldt-kya package on PyPI.
comment: 26 pages including appendix. Code available under Apache 2.0 at https://github.com/veldtlabs/veldt-kya (pip install veldt-kya). Two-domain worked examples (loan decisioning under NYDFS/ECOA/CFPB; clinical triage under HIPAA/21 CFR Part 11/FDA SaMD).Reproducibility artifacts in-tree
An Agent-Centric Dynamical Systems Perspective on Multi-Agent Reinforcement Learning
Analysing learning in Multi-Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to \textit{individual} decision-making. Practitioners frequently struggle to compare training runs due to the inherent stochasticity in algorithms arising from random dithering exploration, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, oft rely on mean-field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, can lead to dissonance between analytical predictions and actual agent realisations. We propose modelling MARL training as a \textit{coupled stochastic dynamical systems}, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we pragmatically analyse the stability and sensitivity of agent behaviour, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us to rigorously study the inherent stochasticity of MARL, providing a deeper understanding of system behaviour.
SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems ICLR 2026
Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework for multi-VLM systems through uncertainty-weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system-level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems. Our code is publicly available at https://github.com/chungenyu6/SCoOP.
comment: Accepted to ICLR 2026 Workshop on Agentic AI in the Wild: From Hallucinations to Reliable Autonomy
Scaling Small Agents Through Strategy Auctions ICML 2026
Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.
comment: ICML 2026
ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling
Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.
comment: Project website: https://schedulestream.github.io
Understanding Social-Force Model in Psychological Principles of Collective Behavior
To well understand crowd behavior, microscopic models have been developed in recent decades, in which an individual's behavioral/psychological status can be modeled and simulated. A well-known model is the social-force model innovated by physical scientists (Helbing and Molnar, 1995; Helbing, Farkas and Vicsek, 2000; Helbing et al., 2002). This model has been widely accepted and mainly used in simulation of crowd evacuation in the past decade. A problem, however, is that the testing results of the model were not explained in consistency with the psychological findings, resulting in misunderstanding of the model by psychologists. This paper will bridge the gap between psychological studies and physical explanation about this model. We reinterpret this physics-based model from a psychological perspective, clarifying that the model is consistent with psychological theories on stress, including time-related stress and interpersonal stress. Based on the conception of stress, we renew the model at both micro-and-macro level, referring to multi-agent simulation in a microscopic sense and fluid-based analysis in a macroscopic sense. The cognition and behavior of individual agents are critically modeled as response to environmental stimuli. Existing simulation results such as faster-is-slower effect will be reinterpreted by Yerkes-Dodson law, and herding and grouping effect as well as oscillation phenomenon are further discussed for pedestrian crowd. In brief the social-force model exhibits a bridge between the physics laws and psychological principles regarding crowd motion, and this paper will renew and reinterpret the model on the foundation of psychological studies.
comment: 37 pages, 20 figure
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.
comment: CVR 2026
ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents
Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.
comment: Under peer review; 37 pages, 10 figures, 5 tables
Approximate Proportionality in Online Fair Division ICML
We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envy-freeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.
comment: Appears in the 43rd International Conference on Machine Learning (ICML), 2026
ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems
Multi-agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations propagate through agent interactions remains poorly understood. We present ValueFlow, a perturbation-based framework that measures value drift in multi-agent systems via a 56-value valuation dataset derived from the Schwartz Value Survey, with agent value orientations scored using an LLM-as-a-judge protocol. ValueFlow decomposes value drift into agent-level response behavior and system-level structural effects, captured by two metrics: \b{eta}-susceptibility, an agent's sensitivity to perturbed peer value signals, and system susceptibility (SS), the effect of node-level perturbations on final system outputs.Experiments span across value dimensions, backbones, personas, and topologies, showing that susceptibility varies sharply across values and is strongly shaped by interaction structure, indicating that value alignment in multi-agent systems is a system-level property, not just an agent-level one. ValueFlow thus provides a principled basis for auditing and mitigating value propagation in deployed multi-agent systems.
comment: Preprint. Under review. 28 pages, 10 figures
Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition ICML 2026
Offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out-of-distribution joint action selections. In this work, we highlight that a fundamental challenge in offline MARL arises from the multi-equilibrium nature of cooperative tasks, which induces a highly multimodal joint behavior policy space coupled with heterogeneous-quality behavior data. This makes it difficult for individual policy regularization to align with a consistent coordination pattern, leading to the policy distribution shift problems. To tackle this challenge, we design a sequential score function decomposition method that distills per-agent regularization signals from the joint behavior policy, which induces coordinated modality selection under decentralized execution constraints. Then we leverage a flexible diffusion-based generative model to learn these score functions from multimodal offline data, and integrate them into joint-action critics to guide policy updates toward high-reward, in-distribution regions under a shared team reward. Our approach achieves state-of-the-art performance across multiple particle environments and Multi-agent MuJoCo benchmarks consistently. To the best of our knowledge, this is the first work to explicitly address the distributional gap between offline and online MARL, paving the way for more generalizable offline policy-based MARL methods.
comment: ICML 2026 Accepted
The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More
Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
Multi-Agent Teams Hold Experts Back ICML 2026
Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 41.1% on ML benchmarks. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.
comment: Accepted at the International Conference on Machine Learning (ICML 2026)
Systems and Control (EESS)
Optimization of Predictive Maintenance Schedules under Uncertainty: A Scenario-Based Theoretical Framework
This paper proposes a scenario-based framework for predictive maintenance scheduling under uncertainty in a finite planning horizon. The considered setting involves multiple assets for which maintenance decisions are informed by three heterogeneous sources of information: calendar-based overhaul intervals, usage-based limits driven by uncertain future operating cycles, and condition-monitoring outputs represented through remaining useful life (RUL) estimates with uncertainty. While these elements have been studied extensively in the maintenance literature, they are often treated separately or only partially integrated. In contrast, the proposed formulation evaluates complete maintenance schedules under simulated future scenarios and compares them using expected-cost and tail-risk criteria. The contribution is primarily conceptual and methodological: we define a unified finite-horizon decision framework that combines calendar-, usage-, and prognostics-based information within a common scheduling problem. A small synthetic computational example is used as a proof of concept. The results show that integrated scenario-based policies can substantially outperform simpler single-trigger rules, while the difference between risk-neutral and risk-aware integrated policies remains modest under the present calibration.
comment: This work has been submitted to the IEEE for possible publication
Fault-Ride-Through Coordination Strategy for Offshore AC Islands with Multi-Infeed HVDC Interconnections
Large-scale offshore Wind Farms (WFs) are considered key assets towards realizing a sustainable power system. These systems are often configured as offshore AC islands and their integration largely depends on the High-Voltage-Direct-Current (HVDC) technology. This topology, while it enables cost-effective transmission over large offshore distances, may lead to operational challenges. Specifically, the operation of offshore AC islands during faults and the grid code requirement fulfillment are identified as a major challenges for their large-scale deployment. To address this pressing issue, a comprehensive coordination control strategy for the different participating converters in multi-infeed AC offshore islands during Fault Ride Through (FRT) operation is presented in this work. The proposed strategy introduces advanced control functions in the FRT schemes of both the HVDC and WF converters, such as zero active and reactive power injection during faults, as well as post-fault active power droop control coordination to tackle power imbalances. The proposed FRT coordination strategy is validated through both extensive simulations in PSCAD/EMTDC, as well as with Power Hardware-in-the-Loop (PHIL) experimental results, considering both AC and DC faults.
BuilDyn: Excitation-Driven Data Generation for Building Thermal Dynamics Modeling and Control
Machine learning (ML) is increasingly used for data-driven modeling of buildings to enable downstream tasks such as fault detection and diagnosis, and energy-efficient control. While recent work improves generalization across building characteristics, weather, and occupancy, generalization also depends on sufficient exploration of the control-driven system state space. Existing real-world datasets and simulation environments predominantly reflect stationary operation under fixed control policies, resulting in limited excitation and reduced robustness to unseen operating conditions. This paper introduces BuilDyn, a package based on BuilDa that enables customizable excitation strategies for control-oriented data generation. BuilDyn further supports sampling from representative building distributions and provides a Python interface for easy integration into machine learning pipelines. We demonstrate the benefits of BuilDyn by comparing the performance of data-driven ML models trained on non-excited and excited data for one building. With BuilDyn, we hope to advance scalable control-oriented modeling and support future directions such as transfer learning and building-specific foundation models.
Distributed Nonlinear Model Predictive Control for District Heating Networks
This paper presents a distributed nonlinear model predictive control that uses alternating direction method of mul tipliers for district heating networks. Exploiting a graph-based modeling of the thermal dynamics, our controller optimizes the mass flow absorption of buildings in a distributed cooperative scheme that mediates between the superior performance of the centralized control and the privacy preservation of the decentralized schemes. A benchmark three-building network simulation is used to compare the performance of the proposed solution with a decentralized model predictive control scheme.
comment: 9 pages, 9 figures
Teleoperation Operational Design Domain based on Minimal Risk Maneuver Capability
This article discusses the concept of an Operational Design Domain (ODD) designed specifically for teleoperated road vehicles. For this purpose, the ODD concept designed for automated driving is adapted for teleoperation. As teleoperation becomes more common in regular traffic, the question arises under which operating conditions such vehicles are able and allowed to drive. Currently, these conditions are selected primarily based on network performance. From a safety perspective, it is difficult to base such a selection on a reliable connection because it is almost impossible to guarantee sufficient reliability. With this in mind, the ODD concept designed for automated driving is adapted for teleoperation: A concept is proposed for basing the ODD for a teleoperation system on the capability of the teleoperated vehicle to perform a minimal risk maneuver using a dedicated system designed solely for this purpose. This concept is then demonstrated using a use case example.
comment: This is a preprint. The manuscript is under preparation and has not yet been submitted for peer review
Tackling Interference in HAPS Networks via Angular-Aware Clustering and RSMA
High Altitude Platform Stations (HAPS) have emerged as a promising enabler for next-generation wireless networks, offering ubiquitous connectivity to ground users. Operating either in standalone mode or in integration with terrestrial networks, HAPS can significantly enhance both coverage and capacity due to their strategic placement in the stratosphere. However, interference management in HAPS-empowered networks requires special attention due to the unique propagation characteristics of HAPS links. In particular, the strong line-of-sight (LoS) conditions between HAPS and ground users result in limited channel variability, thereby intensifying inter-user interference. In this work, we consider a single HAPS serving multiple ground users through multiple beams over a limited number of orthogonal resource blocks (RBs). To address the resulting interference, we propose a novel angular-aware user clustering and interference-aware RB allocation framework that strategically clusters users, designs beams to serve each cluster, and allocates RBs to users across clusters. To further mitigate intra-RB interference, a rate-splitting multiple access (RSMA) scheme is incorporated. Simulation results demonstrate that the proposed clustering and RSMA-based approach significantly outperforms baseline schemes in terms of achievable per-user spectral efficiency.
Robustness Enhancement of Consensus Networks: the Optimal Memory Depth
Understanding what governs collective robustness and how it can be enhanced remains a central pursuit in network science. This paper investigates the robustness of multi-agent consensus networks, quantified by the $H_2$ performance metric, and delves into the enhancing effect of agents' local memory on it. Inspired by the hierarchical temporal structure of memory observed in neuroscience, we focus on the role of memory depth, which reflects the temporal features of memory from recent to remote. Building on linear extrapolation, we propose a consensus protocol with single-step memory and tunable memory depth, derive the necessary and sufficient condition for achieving consensus, and show that the protocol exhibits an inheritable consensus property across memory depths. Furthermore, analytical expressions for the $H_2$ performance metric, which depend on the memory factor, memory depth, coupling gain, and Laplacian spectrum, are established. Under balanced usage of real-time and memory information, we demonstrate that memory at any accessible depth enhances $H_2$ performance, and the optimal memory depth occurs at either the most recent or the most remote memory, contingent upon certain parameter regions. Further detailed discussions are provided to clarify the broader implications of our findings.
comment: 19 pages, 5 figures
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging ICML 2026
Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.
comment: ICML 2026 Workshop on Weight-Space Symmetries: from Foundations to Practical Applications
Real-Time Retargeting Using Controllability Boundary for Chandrayaan-3 Lunar Landing
This paper presents the real-time retargeting guidance policy developed for the Chandrayaan-3 lunar landing mission. The baseline guidance generates approximate fuel-optimal descent trajectories, while a high-level policy enables safe retargeting to alternate sites when the nominal site becomes infeasible. The retargeting strategy leverages a convex representation of the controllability boundary, allowing rapid feasibility checks and real-time target updates. To the best of the authors knowledge, this represents the first application of a data-driven retargeting framework in an operational lunar landing mission. Pre-flight simulations and Chandrayaan-3 flight results validate the effectiveness of the proposed approach.
comment: 8 pages, 6 figures, Accepted for publication in American Control Conference 2026
Decoupled Thrust-Axis Attitude Control Using Quaternions for Chandrayaan-3 Lunar Landing Mission
Chandrayaan-3 mission achieved a historic milestone with its successful soft landing near the lunar south pole, highlighting the critical role of the navigation, guidance, and control (NGC) system. Navigation provided vehicle state estimates relative to the Moon center, while a polynomial based guidance scheme computed the required acceleration profile to meet terminal landing conditions. This acceleration demand was translated into total thrust magnitude and attitude commands generation. Attitude command generation involved aligning the thrust axis with the required acceleration vector and constraining rotation about the thrust axis, typically governed by mission-specific requirements. Although quaternion-based control laws are preferred for their singularity-free representation, they inherently couple all three rotational axes. This coupling can lead to undesirable interactions between guidance and control, especially during large rotations about the thrust axis, due to the quaternion shortest-path property. This paper proposes a novel quaternion-based decoupling method that enables independent thrust-axis control, mitigating guidance-control interaction and ensuring proper attitude commands generation for lander attitude control.
comment: 6 pages, 7 figures, Published in Indian Control Conference 2025
Closed-Loop Identification of Periodically Time-Varying Systems via Cyclic Reformulation
This paper studies closed-loop identification of linear periodically time-varying (LPTV) plants, with emphasis on open-loop unstable plants for which open-loop experiments are not practically available. The central contribution is an exact algebraic plant-extraction theorem for cycled closed-loop realizations: for square strictly proper plants and a controller path satisfying an invertibility condition, the cycled plant transfer matrix is recovered from a shared state-space realization of the stable closed-loop maps from the external reference to the plant output and to the control input, without state augmentation, and without requiring the recovered plant realization to be stable. Thus, the stability requirement for data generation is shifted from the open-loop plant to the internally stable closed-loop system. Building on this result, a closed-loop identification algorithm is constructed that takes the reference, output, and input signals as data, applies standard subspace identification to the cycled signals, performs the algebraic plant extraction, and recovers the LPTV plant state-space parameters via a coordinate transformation; the conditioning of the inverse controller path governs the reliability of the extraction step. Numerical examples demonstrate the recovery of stable and open-loop unstable SISO LPTV plants and validate a MIMO case through coordinate-invariant Markov-parameter comparisons.
comment: Submitted to Automatica
Iterative Reduced-Rank MMSE Estimation of Sparse Range Profiles from Non-Contiguous Radar Transmission Spectra
Ongoing demand for radio spectrum by commercial wireless services has steadily increased pressure on the frequency bands traditionally reserved for radar. This paper addresses the joint problem of designing non-contiguous radar transmission spectra and estimating the range profile from the resulting reduced measurement set. Transmission spectra are constructed using a Marginal Fisher Information (MFI) criterion that removes blocks of frequencies contributing least to estimation accuracy. To process the underdetermined signals acquired from the resulting sparse measurement vector, an iterative Reduced-Rank Minimum Mean-Square Error (RRMMSE) estimator is proposed. The estimator starts with a single-target hypothesis and grows the active target subspace one range bin at a time, updating the a~priori target covariance matrix in each iteration using both the largest estimated reflection coefficient and its posterior error variance. This avoids inversion of the full $M{\times}M$ covariance matrix that would be required by a one-step MMSE and concentrates the rank of the estimator on the support of significant scatterers. The Bayesian Cramér--Rao Lower Bound (CRLB) on the per-bin reflection coefficient is derived for the non-contiguous spectrum measurement model, and the computational complexity of the proposed estimator is shown to scale as $\Order(G^2 M K^2)$, where $G$ is the number of detectable scatterers, $M$ is the number of range bins, and $K$ is the number of preserved spectral samples. Simulations using $50\%$ and $75\%$ spectrally occupied MFI-designed spectra confirm that the algorithm recovers sparse range profiles with Mean-Square Error (MSE) close to the fully filled baseline when the number of significant scatterers is not larger than the rank of the sparse sensing matrix.
$α$-stability of Differentially Flat Systems with Application to Newton-Raphson Tracking Control for Vehicle Dynamics
This paper studies the $α$-stability property of differentially flat nonlinear dynamical systems. The results build off the recently introduced notion of $α$-stability, which is particularly amenable to characterize the ability of a system to track dynamic output reference signals. We consider systems controlled using the Newton-Raphson tracking controller, which results in closed-form control policies, therefore it is computationally efficient, and it has been shown to be effective to control a large variety of mobile robots, including autonomous vehicles. The main results of the paper consist in sufficient conditions for the $α$-stability of differentially flat systems and for the equivalence between the proposed control algorithm and the Newton-Raphson tracking controller applied directly to the nonlinear dynamics. We demonstrate the behavior of the proposed controller applied to the kinematic unicycle and dynamic bicycle models.
Distributed Non-Uniform Scaling Control of Multi-Agent Formation with Dynamic Agent Joining
Non-uniform scaling control of formation enables multi-agent systems to adjust their shape by scaling with different ratios along different coordinate axes, offering enhanced flexibility in complex environments. However, like most existing formation maneuver strategies, it typically assumes a fixed set of agents, limiting its applicability in scenarios requiring dynamic team expansion. This paper introduces a distributed control framework that enables a formation to incorporate new agents during non-uniform scaling maneuvers in arbitrary dimensions while preserving the spectral properties of the graph Laplacian. Simulation examples validate the effectiveness of the theoretical results.
comment: This paper has been accepted by IFAC 2026
ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning
Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.
comment: 7 pages, 5 figures, 5 tables. Submitted to IEEE RA-L
Conservation-Based Feedback-Circuit Decomposition for Linear Forced Systems
We present a conservation-based feedback-circuit decomposition specifically for general linear forced systems. In a role parallel to that of eigenvalues and eigenvectors for initial-value problems, the complete set of independent intrinsic circuit gains and their associated forcing-transformation vectors provide a complete analytical representation of both transient and equilibrium forced solutions. The sign of intrinsic circuit gains determines whether successive feedback cycles exhibit monotonic or oscillatory convergence to transformed forcing, while the forcing-transformation vectors determine the structure of transformed forcing. The exact transient and equilibrium solutions are represented analytically through the convergence of the finite-cycle forcing-transformation kernel to the equilibrium forcing-transformation kernel, which is guaranteed regardless of whether the magnitudes of circuit gains exceed one or unstable modes exist in the system. The feedback-circuit decomposition provides a new generic foundational mathematical tool for understanding, predicting, and controlling forced responses in a broad range of coupled linear systems across science and engineering.
Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics
Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.
Resonant Method-based Fully Automated Core Loss Measurement System for Sub-MHz Magnetics With Switched Capacitor Sequence
Accurate loss characterization is essential for the design of high-frequency power magnetic components. State-of-the-art resonant characterization methods are attractive for high accuracy and low sensitivity, especially at the MHz regime. However, they predominantly rely on manual tuning and computationally intensive Fast Fourier Transform (FFT) analysis to identify resonant conditions, causing both inefficiencies and inaccuracies. To ensure accuracy and expedite the process, this paper proposes a fully automated measurement architecture, the core innovation of which lies in the integration of digitally-controlled switched capacitor sequences and onboard signal processing circuits,enabling automated sweeping of both frequency and drive level for complete and rapid characterization with no human intervention. A design guideline for the switched capacitor sequence is presented and common commercial electromechanical power relays are characterized to enable sub-MHz measurements. Experimental results for several different magnetic materials demonstrate that the proposed system has great accuracy and is able to collect more than 1000 data points within 20 seconds, providing a very fast and robust solution for high-frequency magnetic characterization.
comment: 8 pages, 6 figures, conference
Realization of Precise Perforating Using Dynamic Threshold and Physical Plausibility Algorithm for Self-Locating Perforating in Oil and Gas Wells
Accurate depth measurement is critical for targeting designated perforation intervals to maximize hydrocarbon recovery. While next-generation automated wireless perforating techniques reduce reliance on costly surface infrastructure and personnel, they lack the continuous depth correlation provided by conventional wireline cables. Consequently, correlating real-time casing collar locator (CCL) signals with a pre-recorded casing tally is essential for automatic depth determination. However, implementing this measurement remains challenging: downhole instruments must process CCL signals in real-time to identify collar signatures from complex interference, a task severely restricted by the limited computational resources and power budget of high-temperature downhole electronics. To address these constraints, this work proposes the Dynamic Threshold and Physical Plausibility Depth Measurement and Perforation Control (DTPPMP) system. This integrated solution enables in situ depth calibration by correlating CCL signals with the casing tally using lightweight algorithms for dynamic-threshold-based collar recognition and physical plausibility verification. Field tests demonstrate a collar recognition F1-score of 98.6% at a throughput of 1000 Sa/s. Notably, the algorithm requires only 1.5 μs per sample, confirming its computational efficiency and suitability for deployment on resource-constrained, high-temperature downhole platforms.
comment: This work has been submitted to the IEEE for possible publication
Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.
comment: 12 pages, 3 figures
Actor-Identifier-Critic Reinforcement Learning for Adaptive Model-Free Optimal Control of Nonlinear Systems with Stochastic Packet Dropouts
Packet dropouts in control systems poses a critical challenge, as it can significantly compromise system performance and stability. In these conditions, classical controllers often struggle to deliver effective control, as they rely on accurate system models, which may not always be available. This paper proposes a novel Actor-Identifier-Critic~(AIC) controller to address model-free tracking control of nonlinear systems in the presence of packet dropouts in both the controller-to-actuator and sensor-to-controller channels. Using an identifier to learn the system dynamics, the proposed controller is able to handle packet dropouts in the communication link and facilitate gradient propagation from the critic to the actor within a model-free control framework. The performance of the proposed method is demonstrated on two nonlinear SIMO and MIMO systems and a case study on power system stability subject to stochastic packet dropouts.
Unraveling tensor structures in correct-by-design controller synthesis SC
Formal safety guarantees on the synthesis of controllers for stochastic systems can be obtained using correct-by-design approaches. These approaches often use abstractions as finite-state Markov Decision Processes. As the state space of these MDPs grows, the curse of dimensionality makes the computational and memory cost of the probabilistic guarantees, quantified with dynamic programming, scale exponentially. In this work, we leverage decoupled dynamics and unravel, via dynamic programming operations, a tree structure in the Canonical Polyadic Decomposition (CPD) of the value functions. For discrete-time stochastic systems with syntactically co-safe linear temporal logic (scLTL) specifications, we provide provable probabilistic safety guarantees and significantly alleviate the computational burden. We provide an initial validation of the theoretical results on several typical case studies and showcase that the uncovered tree structure enables efficient reductions in the computational burden.
comment: code: https://github.com/shaesaert/SySCoRe/releases/tag/v0.tens
Quasi-Static Control of Discrete Cosserat Rod
In this paper, we design feedback control laws for soft robots modelled using the Cosserat rod, which is spatially discretised using the Piecewise Constant Strain (PCS) approach. The PCS approach transforms the nonlinear PDEs describing the Cosserat rod to a system of nonlinear ODEs. This simplification results in a model describing soft robots which is similar to the serial rigid-link manipulators. We design feedback control laws for the quasi-static PCS model by using the external wrenches as control input. The control laws are designed based on state-feedback linearisation in strain and task spaces. An extensive set of numerical results demonstrates the performance of the control laws for end-effector trajectory tracking and shape control of soft robots.
comment: Submitted to 17th APCA International Conference on Automatic Control and Soft Computing (CONTROLO 2026)
Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells
Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.
comment: Accepted for publication at IEEE CASE 2026
Frequency Quality Metrics based on Second-Order Derivative and Autocorrelation
This industry-oriented paper originates from the observation that current frequency quality metrics utilized by transmission system operators (TSOs) fail to fully capture the dynamic behavior of the grid frequency. Motivated by this gap, the paper proposes novel frequency quality metrics based on second-order dynamics and stochastic autocorrelation. Using real-world data with 0.1 s and 1 s resolution from the Irish, Great Britain and Nordic systems and running dynamic stochastic simulations, the paper shows that the proposed metrics bring new and counterintuitive insights in terms of how good or poor the frequency quality of power grids is beyond current well-known metrics. In particular, the paper shows that a power system may show good frequency quality using standard metrics and poor frequency quality using the proposed metrics. Overall, the paper contributes to improve the understanding of frequency quality.
Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild IROS 2025
Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.
comment: 8 pages, 8 figures, submitted to IROS 2025
Singularity-free dynamical invariants-based quantum control
State preparation is a cornerstone of quantum technologies, underpinning applications in computation, communication, and sensing. Its importance becomes even more pronounced in non-Markovian open quantum systems, where environmental memory and model uncertainties pose significant challenges to achieving high-fidelity control. Invariant-based inverse engineering provides a principled framework for synthesizing analytic control fields, yet existing parameterizations often lead to experimentally infeasible, singular pulses and are limited to simplified noise models such as those of Lindblad form. Here, we introduce a generalized invariant-based protocol for finite-dimensional state preparation under arbitrary noise conditions. We transform the finite-dimensional control problem into the equivalent problem for a single-qubit, by restricting the dynamics to a designed SU(2) subspace. The control protocol then proceeds in two-stages: first, we construct a family of bounded pulses that achieve perfect state preparation in a closed system; second, we identify the optimal member of this family that minimizes the effect of noise. The framework accommodates both (i) characterized noise, enabling noise-aware control synthesis, and (ii) uncharacterized noise, where a noise-agnostic variant preserves robustness without requiring a master-equation description. Numerical simulations demonstrate high-fidelity state preparation across diverse targets while producing smooth, hardware-feasible control fields. This singularity-free framework extends invariant-based control to realistic open-system regimes, providing a versatile route toward robust quantum state engineering on NISQ hardware and other platforms exhibiting non-Markovian dynamics.
A Vertical Look at UAV Connectivity in the Wild: Cellular vs. Starlink, 3D Characterization, and Performance Prediction
In this paper, we present an open-source measurement platform designed to characterize the performance of commercial cellular (Verizon, a major US provider) and LEO satellite (Starlink) networks through real-world flight tests in rural environments. We implement a comprehensive multi-layer measurement approach spanning physical layer signal metrics, multi-cell network topology, and end-to-end (E2E) application performance. Through an extensive flight campaign with more than $10$ flight tests, $4.5$+ hours of flight time resulting in more than $18$K samples, we present the first detailed, open-source dataset analyzing dual cellular and Starlink performance for low-altitude UAV operations. Our cellular-Starlink comparative results, which are collected \emph{simultaneously at the same time and location}, demonstrate significant performance differences between the two technologies: the LEO satellite link achieves superior latency performance with $95\%$ of Round-Trip Time (RTT) measurements below $50$ ms compared to $80\%$ under $150$ ms for cellular, and exceptional downlink capacity with $95\%$ exceeding $25$ Mbps versus only $5$ Mbps for cellular. Our analysis on cellular network performance demonstrates that while higher altitudes (e.g., $330+$ m above the sea level) improve signal power by $15-20$ dB via line-of-sight (LOS) propagation, it causes a $3-4$ $\times$ increase in handover rates, which is due to excessive multi-cell visibility rather than signal degradation. Furthermore, we observe asymmetric impacts on the RTT performance due to handovers such that $53.5$\% of handovers improve RTT, but worst-case degradation ($275$ ms) is $2$ $\times$ larger than best-case improvement ($137$ ms).
Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory
Memory disaggregation via CXL enables multi-host resource sharing. However, existing CXL sharing mechanisms enforce coarse-grained, host-level permissions only, leaving isolation to the operating system. Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation. This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory. We present Space-Control, an architectural abstraction that introduces a cross-host identity primitive to enforce confidentiality and integrity. We decouple authorization from the untrusted OS using a hardware-rooted validation engine (SPACE) to establish immutable process identity and a Permission Checker at the memory egress point for fine-grained permission validation. Our design supports 127 concurrent processes across 255 hosts with only 1.56% storage overhead. Cycle-level evaluation using gem5 + SST shows that Space-Control incurs a minimal 3.3% performance penalty with a modest 16 KiB cache, providing a practical and scalable foundation for secure, process-level memory disaggregation.
Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models CVPR
Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing trajectory optimization still relies heavily on expert-crafted formulations and does not support intent-conditioned decision-making. This paper proposes an intent-aligned spacecraft guidance framework that links high-level reasoning and safe trajectory optimization through explicit intermediate abstractions, based on behavior sequences and waypoint constraints. A foundation model first predicts an intent-aligned behavior plan, a waypoint generation model then converts it into waypoint constraints, and the safe trajectory is computed via optimization. This decomposition enables scalable supervision without sacrificing safety. Numerical experiments in close-proximity operation scenarios demonstrate that the proposed pipeline achieves over 90\% SCP convergence and yields a $1.5\times$ higher rate of generating trajectories that satisfy the top intent-prioritized performance criteria than heuristic decision-making. These results support the use of intermediate behavior abstraction as a practical interface between foundation-model reasoning and safety-critical onboard spacecraft autonomy.
comment: Accepted for Computer Vision and Pattern Recognition Conference (CVPR) 2026, AI4Space Workshop (4-page Short paper). 9 pages, 3 figures (including supplementary materials)
Sensor Fusion for Track Geometry Monitoring: Integrating On-Board Condition Monitoring and Degradation Models via Kalman Filtering
Track geometry monitoring is essential for maintaining the safety and efficiency of railway operations. While Track Recording Cars (TRCs) provide accurate measurements of track geometry indicators, their limited availability and high operational costs restrict frequent monitoring across large rail networks. Recent advancements in on-board sensor systems installed on in-service trains offer a cost-effective alternative by enabling high-frequency, albeit less accurate, data collection. This study proposes a method to enhance the reliability of track geometry predictions by integrating low-accuracy sensor vibration signals with degradation models through a Kalman filter framework. An experimental campaign using a low-cost sensor system mounted on a TRC evaluates the proposed approach. The results demonstrate that incorporating frequent sensor data significantly reduces prediction uncertainty, even when the data is noisy. The study also investigates how the frequency of data recording influences the size of the credible prediction interval, providing guidance on the optimal deployment of on-board sensors for effective track monitoring and maintenance planning.
TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human Groups
Robots navigating human-populated environments must avoid collisions while respecting the social structure of crowds, particularly the implicit boundaries of social groups. Most navigation approaches model humans as independent individuals,causing socially disruptive behavior even when collision-free. This paper presents TAGA (Tangent Action for Group Avoidance), detected group boundaries via tangent-path maneuvers without modifying the underlying navigation policy. A hierarchical safety controller coordinates group-level avoidance with individual collision prevention. We propose the Group Crossing Rate (GCR), a continuous metric measuring the fraction of timesteps the robot spends inside any group convex hull, providing finer-grained social compliance assessment than terminal metrics alone. We introduce a realistic crowd simulation benchmark with five empirically grounded phases: individual speed heterogeneity, group speed coupling, F-formation static groups, leader-follower dynamics, and convex-hull boundaries, evaluated under both ORCA and Social Force pedestrian dynamics. Experiments across ORCA, Social Force, DS-RNN, and Intention-RL reveal a reactive-learning asymmetry: TAGA provides the largest gains for classical reactive baselines (up to +8pp success rate, GCR halved) with near-zero cost for learned policies. These findings offer actionable guidance for when modular group-awareness adds value versus when end-to-end group-aware training is preferable.
comment: 8 pages, 3 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)