Contributors Forks Stargazers Issues

Updated on 2026.04.06

Usage instructions: here

🔥 HuggingFace Hot Papers

Publish Date Title & Abstract Authors Links
2026-03-28 LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model HF-Hot 🔥 HF#1
Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce…
Yue Wang Team ArXiv
2026-04-01 Signals: Trajectory Sampling and Triage for Agentic Interactions HF-Hot 🔥 HF#2
Agentic applications based on large language models increasingly rely on multi-step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post-deployment remains challenging. Agent trajectories are voluminous and non-deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is…
Salman Paracha Team ArXiv
2026-03-30 An Empirical Recipe for Universal Phone Recognition HF-Hot 🔥 HF#3
Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We…
David R. Mortensen Team ArXiv
2026-04-01 Forecasting Supply Chain Disruptions with Foresight Learning HF-Hot 🔥 HF#4
Anticipating supply chain disruptions before they materialize is a core challenge for firms and policymakers alike. A key difficulty is learning to reason reliably about infrequent, high-impact events from noisy and unstructured inputs - a setting where general-purpose models struggle without task-specific adaptation. We introduce an end-to-end framework that trains LLMs to produce calibrated…
Kris Skotheim Team ArXiv
2026-04-02 CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery HF-Hot 🔥 HF#5
Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL…
Paul Pu Liang Team ArXiv
2026-03-31 Video Models Reason Early: Exploiting Plan Commitment for Maze Solving HF-Hot 🔥 HF#6
Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment:…
Olga Russakovsky Team ArXiv
2026-04-01 Therefore I am. I Think HF-Hot 🔥 HF#7
We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations…
Rajagopal Venkatesaramani Team ArXiv
2026-03-03 MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines HF-Hot 🔥 HF#8
Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the…
Nataniel Ruiz Team ArXiv
2026-04-02 NearID: Identity Representation Learning via Near-identity Distractors HF-Hot 🔥 HF#9
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the…
Peter Wonka Team ArXiv
2026-03-27 Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models HF-Hot 🔥 HF#10
While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these…
Quentin Macé Team ArXiv
2026-04-02 ActionParty: Multi-Subject Action Binding in Generative Video Games HF-Hot 🔥 HF#11
Recent advances in video diffusion have enabled the development of “world models” capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific…
Aliaksandr Siarohin Team ArXiv
2026-03-30 AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation HF-Hot 🔥 HF#12
Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and…
Xihui Liu Team ArXiv
2026-03-27 Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents HF-Hot 🔥 HF#13
As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the…
Sebastian Schuster Team ArXiv
2026-03-31 ASI-Evolve: AI Accelerates AI HF-Hot 🔥 HF#14
Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a…
Pengfei Liu Team ArXiv
2026-04-01 Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial HF-Hot 🔥 HF#15
Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific…
Jun Wang Team ArXiv
2026-03-30 MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios HF-Hot 🔥 HF#16
We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse…
Yuliang Liu Team ArXiv
2026-04-01 Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning HF-Hot 🔥 HF#17
We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA…
Mohammad R. Abu Ayyash ArXiv
2026-04-02 Steerable Visual Representations HF-Hot 🔥 HF#18
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be…
Yuki M. Asano Team ArXiv
2026-04-02 Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models Manipulation VLA HF-Hot 🔥 HF#19
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast,…
Zhaoxia Yin Team ArXiv
2026-03-25 Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning HF-Hot 🔥 HF#20
Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity…
Lorenzo Natale Team ArXiv
2026-04-01 AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration HF-Hot 🔥 HF#21
Membership Inference Attacks (MIAs) serve as a fundamental auditing tool for evaluating training data leakage in machine learning models. However, existing methodologies predominantly rely on static, handcrafted heuristics that lack adaptability, often leading to suboptimal performance when transferred across different large models. In this work, we propose AutoMIA, an agentic framework that…
Xinchao Wang Team ArXiv
2026-04-02 Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation HF-Hot 🔥 HF#22
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift…
Xiaoguang Han Team ArXiv
2026-04-01 Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time HF-Hot 🔥 HF#23
The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability….
Maliheh Izadi Team ArXiv / Web
2026-04-02 VOID: Video Object and Interaction Deletion HF-Hot 🔥 HF#24
Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to…
Ta-Ying Cheng Team ArXiv
2026-04-02 DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data HF-Hot 🔥 HF#25
Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is…
Sunghyun Cho Team ArXiv
2026-04-02 T5Gemma-TTS Technical Report HF-Hot 🔥 HF#26
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing…
Kiyoshi Kurihara Team ArXiv
2026-04-01 Executing as You Generate: Hiding Execution Latency in LLM Code Generation HF-Hot 🔥 HF#27
Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without…
David Lo Team ArXiv
2026-04-02 Automatic Image-Level Morphological Trait Annotation for Organismal Images HF-Hot 🔥 HF#28
Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we…
Yu Su Team ArXiv
2026-04-01 UniRecGen: Unifying Multi-View 3D Reconstruction and Generation HF-Hot 🔥 HF#29
Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen,…
Wenping Wang Team ArXiv
2026-04-01 Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models HF-Hot 🔥 HF#30
Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities…
Mor Geva Team ArXiv
2026-04-01 LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation HF-Hot 🔥 HF#31
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that…
Yova Kementchedjhieva Team ArXiv
2026-04-02 Woosh: A Sound Effects Foundation Model HF-Hot 🔥 HF#32
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality…
Yuki Mitsufuji Team ArXiv
2026-04-02 Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning HF-Hot 🔥 HF#33
Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample…
Alexandre Drouin Team ArXiv
2026-03-29 Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers HF-Hot 🔥 HF#34
Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built…
Xinchao Wang Team ArXiv
2026-04-02 LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model HF-Hot 🔥 HF#35
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual…
Zhijie Deng Team ArXiv
2026-04-02 UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving VLA HF-Hot 🔥 HF#36
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises:…
Xinggang Wang Team ArXiv
2026-04-01 Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory HF-Hot 🔥 HF#37
AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or…
Huaxiu Yao Team ArXiv
2026-03-27 DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models HF-Hot 🔥 HF#38
Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces,…
Wentao Zhang Team ArXiv
2026-04-01 EgoSim: Egocentric World Simulator for Embodied Interaction Generation Manipulation HF-Hot 🔥 HF#39
We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage…
Xudong Xu Team ArXiv
2026-04-02 VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification HF-Hot 🔥 HF#40
Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting…
Haodong Duan Team ArXiv
2026-04-02 Generative World Renderer HF-Hot 🔥 HF#41
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of…
Kaipeng Zhang Team ArXiv
2026-04-02 FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition HF-Hot 🔥 HF#42
Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can…
Kazuhiko Sumi Team ArXiv
2026-04-02 GPA: Learning GUI Process Automation from Demonstrations HF-Hot 🔥 HF#43
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization…
Junnan Li Team ArXiv
2026-04-02 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook HF-Hot 🔥 HF#44
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of…
Shuicheng Yan Team ArXiv
2026-04-02 SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization HF-Hot 🔥 HF#45
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires…
Yongliang Shen Team ArXiv
2026-04-01 PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding HF-Hot 🔥 HF#46
Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful – across document and GUI benchmarks, only 22–71\% of…
Haonan Lu Team ArXiv
2026-04-01 AgentWatcher: A Rule-based Prompt Injection Monitor HF-Hot 🔥 HF#47
Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be…
Jinyuan Jia Team ArXiv
2026-03-26 Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy HF-Hot 🔥 HF#48
As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B…
Aman Mehta ArXiv
2026-04-01 When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation HF-Hot 🔥 HF#49
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only…
Philip S. Yu Team ArXiv
2026-04-01 S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models HF-Hot 🔥 HF#50
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1…
Jack Young ArXiv

(<a href=#updated-on-20260406>back to top</a>)

Dexterous

Publish Date Title & Abstract Authors Links
2026-04-01 How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands Dexterous Manipulation Tactile
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining…
Efi Psomopoulou Team ArXiv
2026-03-31 Kilohertz-Safe: A Scalable Framework for Constrained Dexterous Retargeting Dexterous
Dexterous hand teleoperation requires motion re-targeting methods that simultaneously achieve high-frequency real-time performance and enforcement of heterogeneous kinematic and safety constraints. Existing nonlinear optimization-based approaches often incur prohibitive computational cost, limiting their applicability to kilohertz-level control, while learning-based methods typically lack formal…
Zhen Kan Team ArXiv
2026-03-30 Feel Robot Feels: Tactile Feedback Array Glove for Dexterous Manipulation Manipulation Dexterous
Teleoperation is a key approach for collecting high-quality, physically consistent demonstrations for robotic manipulation. However, teleoperation for dexterous manipulation remains constrained by: (i) inaccurate hand-robot motion mapping, which limits teleoperated dexterity, and (ii) limited tactile feedback that forces vision-dominated interaction and hinders perception of contact geometry and…
Jiangmiao Pang Team ArXiv
2026-03-30 Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching Manipulation Dexterous
Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human…
Kaizhu Huang Team ArXiv
2026-03-30 FocusVLA: Focused Visual Utilization for Vision-Language-Action Models VLA Dexterous
Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual…
Jia Wan Team ArXiv
2026-03-24 Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models Dexterous VLA Sim2Real
Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the…
Guiliang Liu Team ArXiv
2026-03-23 Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration Dexterous
The process of discovery requires active exploration – the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this…
James Cohan Team ArXiv
2026-03-23 UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos Dexterous VLA
Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand…
Huazhe Xu Team ArXiv
2026-03-23 DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming Dexterous Sim2Real
Performing in-hand, contact-rich, and long-horizon dexterous manipulation remains an unsolved challenge in robotics. Prior hand dexterity works have considered each of these three challenges in isolation, yet do not combine these skills into a single, complex task. To further test the capabilities of dexterity, we propose drumming as a testbed for dexterous manipulation. Drumming naturally…
Dorsa Sadigh Team ArXiv
2026-03-23 ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model Dexterous
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language…
Yun Fu Team ArXiv
2026-03-22 Geometrically Plausible Object Pose Refinement using Differentiable Simulation Dexterous Tactile
State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and…
Akansel Cosgun Team ArXiv
2026-03-22 Affordance-Guided Enveloping Grasp Demonstration Toward Non-destructive Disassembly of Pinch-Infeasible Mating Parts Dexterous
Robotic disassembly of complex mating components often renders pinch grasping infeasible, necessitating multi-fingered enveloping grasps. However, visual occlusions and geometric constraints complicate teaching appropriate grasp motions when relying solely on 2D camera feeds. To address this, we propose an affordance-guided teleoperation method that pre-generates enveloping grasp candidates via…
Kensuke Harada Team ArXiv
2026-03-18 DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation Dexterous Tactile
Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich…
Xiaotian Ding Team ArXiv
2026-03-17 TeleDex: Accessible Dexterous Teleoperation Dexterous
Despite increasing dataset scale and model capacity, robot manipulation policies still struggle to generalize beyond their training distributions. As a result, deploying state-of-the-art policies in new environments, tasks, or robot embodiments often requires collecting additional demonstrations. Enabling this in real-world deployment settings requires tools that allow users to collect…
Yuchen Cui Team ArXiv
2026-03-17 DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping Dexterous
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which…
Ancong Wu Team ArXiv
2026-03-17 Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility Dexterous
Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair-bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on-demand quadrupedal assistance robot…
Marco Hutter Team ArXiv
2026-03-17 Dexterous grasp data augmentation based on grasp synthesis with fingertip workspace cloud and contact-aware sampling Dexterous
Robotic grasping is a fundamental yet crucial component of robotic applications, as effective grasping often serves as the starting point for various tasks. With the rapid advancement of neural networks, data-driven approaches for robotic grasping have become mainstream. However, efficiently generating grasp datasets for training remains a bottleneck. This is compounded by the diverse structures…
Kei Okada Team ArXiv / Web
2026-03-17 Fast and Reliable Gradients for Deformables Across Frictional Contact Regimes Dexterous Sim2Real
Differentiable simulation establishes the mathematical foundation for solving challenging inverse problems in computer graphics and robotics, such as physical system identification and inverse dynamics control. However, rigor in frictional contact remains the “elephant in the room.” Current frameworks often avoid contact singularities via non-Markovian position approximations or heuristic…
Fan Shi Team ArXiv
2026-03-17 EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation Dexterous
Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that…
Haoliang Sun Team ArXiv
2026-03-16 End-to-End Dexterous Grasp Learning from Single-View Point Clouds via a Multi-Object Scene Dataset Dexterous
Dexterous grasping in multi-object scene constitutes a fundamental challenge in robotic manipulation. Current mainstream grasping datasets predominantly focus on single-object scenarios and predefined grasp configurations, often neglecting environmental interference and the modeling of dexterous pre-grasp gesture, thereby limiting their generalizability in real-world applications. To address…
Fenglei Ni Team ArXiv
2026-03-16 Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning Dexterous Sim2Real
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale…
Abhishek Gupta Team ArXiv
2026-03-15 One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation Dexterous
Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework…
Jiangmiao Pang Team ArXiv
2026-03-15 Context-Aware Adaptive Shared Control for Magnetically-Driven Bimanual Dexterous Micromanipulation Dexterous
Magnetically actuated robots provide a promising untethered platform for navigation in confined environments, enabling biological studies and targeted micro-delivery. However, dexterous manipulation in complex structures remains challenging. While single-arm magnetic actuation suffices for simple transport, steering through tortuous or bifurcating channels demands coordinated control of multiple…
Dandan Zhang Team ArXiv
2026-03-14 TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects Dexterous
Dexterous manipulation enables complex tasks but suffers from self-occlusion, severe depth noise, and depth information loss when manipulating transparent objects. To solve this problem, this paper proposes TransDex, a 3D visuo-tactile fusion motor policy based on point cloud reconstruction pre-training. Specifically, we first propose a self-supervised point cloud reconstruction pre-training…
Weiwei Shang Team ArXiv
2026-03-14 LDHP: Library-Driven Hierarchical Planning for Non-prehensile Dexterous Manipulation Dexterous
Non-prehensile manipulation is essential for handling thin, large, or otherwise ungraspable objects in unstructured settings. Prior planning and search-based methods often rely on ad-hoc manual designs or generate physically unrealizable motions by ignoring critical gripper properties, while training-based approaches are data-intensive and struggle to generalize to novel, out-of-distribution…
Chao Zhao Team ArXiv
2026-03-14 GraspADMM: Improving Dexterous Grasp Synthesis via ADMM Optimization Dexterous
Synthesizing high-quality dexterous grasps is a fundamental challenge in robot manipulation, requiring adherence to diversity, kinematic feasibility (valid hand-object contact without penetration), and dynamic stability (secure multi-contact forces). The recent framework Dexonomy successfully ensures broad grasp diversity through dense sampling and improves kinematic feasibility via a…
Baoquan Chen Team ArXiv
2026-03-12 HumDex: Humanoid Dexterous Manipulation Made Easy Dexterous
This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applicability to complex whole-body tasks. To address these challenges, we introduce HumDex, a portable…
Yue Wang Team ArXiv
2026-03-12 HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies Dexterous
Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding…
Dorsa Sadigh Team ArXiv
2026-03-12 ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control Dexterous Sim2Real
Physics simulation for contact-rich robotics is often bottlenecked by contact resolution: mainstream engines enforce non-penetration and Coulomb friction via complementarity constraints or constrained optimization, requiring per-step iterative solves whose cost grows superlinearly with contact density. We present ComFree-Sim, a GPU-parallelized analytical contact physics engine built on…
Wanxin Jin Team ArXiv
2026-03-12 Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks Dexterous
Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize…
Daniel Seita Team ArXiv
2026-03-11 Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation Dexterous
Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration…
Lin Shao Team ArXiv
2026-03-11 AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments Dexterous
In densely cluttered environments, physical interference, visual occlusions, and unstable contacts often cause direct dexterous grasping to fail, while aggressive singulation strategies may compromise safety. Enabling robots to adaptively decide whether to clear surrounding objects or directly grasp the target is therefore crucial for robust manipulation. We propose AdaClearGrasp, a closed-loop…
Yang Gao Team ArXiv
2026-03-11 FAR-Dex: Few-shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation Dexterous
Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with…
Zhengtao Zhang Team ArXiv
2026-03-10 Cross-Hand Latent Representation for Vision-Language-Action Models Dexterous
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception–vision, sound, and language-guided intent–to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models…
Xueyan Zou Team ArXiv
2026-03-10 DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation Dexterous
While Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post-training. In parallel, Human-in-the-Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains…
Wenzhao Lian Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)

Manipulation

Publish Date Title & Abstract Authors Links
2026-04-02 Cross-Modal Visuo-Tactile Object Perception Manipulation Tactile
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always…
Mohsen Kaboli Team ArXiv
2026-04-02 CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects Manipulation
When told to “cut the apple,” a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating…
Jianfei Yang Team ArXiv
2026-04-02 Integrated Identification of Collaborative Robots for Robot Assisted 3D Printing Processes Manipulation
In recent years, the integration of additive manufacturing (AM) and industrial robotics has opened new perspectives for the production of complex components, particularly in the automotive sector. Robot-assisted additive manufacturing processes overcome the dimensional and kinematic limitations of traditional Cartesian systems, enabling non-planar deposition and greater geometric flexibility….
Francesco Leali Team ArXiv
2026-04-02 Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning Manipulation VLA
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a…
Dongbin Zhao Team ArXiv
2026-04-02 Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models Manipulation VLA HF-Hot 🔥 HF#19
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast,…
Zhaoxia Yin Team ArXiv
2026-04-02 Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior Manipulation VLA
In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property,…
Fei Wen Team ArXiv
2026-04-02 AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation Manipulation VLA
A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action…
Yadan Luo Team ArXiv
2026-04-01 How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands Dexterous Manipulation Tactile
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining…
Efi Psomopoulou Team ArXiv
2026-04-01 Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking Manipulation
Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at…
Alexander Scheinker Team ArXiv
2026-04-01 EgoSim: Egocentric World Simulator for Embodied Interaction Generation Manipulation HF-Hot 🔥 HF#39
We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage…
Xudong Xu Team ArXiv
2026-04-01 SoftHand Model-W: A 3D-Printed, Anthropomorphic, Underactuated Robot Hand with Integrated Wrist and Carpal Tunnel Manipulation
This paper presents the SoftHand Model-W: a 3D-printed, underactuated, anthropomorphic robot hand based on the Pisa/IIT SoftHand, with an integrated antagonistic tendon mechanism and 2 degree-of-freedom tendon-driven wrist. These four degrees-of-acuation provide active flexion and extension to the five fingers, and active flexion/extension and radial/ulnar deviation of the palm through the wrist,…
Nathan F. Lepora Team ArXiv
2026-04-01 Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning Manipulation
The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views…
Hao-Shu Fang Team ArXiv
2026-04-01 Functional Force-Aware Retargeting from Virtual Human Demos to Soft Robot Policies Manipulation
We introduce SoftAct, a framework for teaching soft robot hands to perform human-like manipulation skills by explicitly reasoning about contact forces. Leveraging immersive virtual reality, our system captures rich human demonstrations, including hand kinematics, object motion, dense contact patches, and detailed contact force information. Unlike conventional approaches that retarget human joint…
Harsha Prahlad Team ArXiv
2026-04-01 Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation Manipulation
Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been…
Minghui Zheng Team ArXiv
2026-03-31 Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models Manipulation Sim2Real
This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This…
Mohd Suhaib Team ArXiv
2026-03-31 Passive iFIR filters for data-driven velocity control in robotics Manipulation
We present a passive, data-driven velocity control method for nonlinear robotic manipulators that achieves better tracking performance than optimized PID with comparable design complexity. Using only three minutes of probing data, a VRFT-based design identifies passive iFIR controllers that (i) preserve closed-loop stability via passivity constraints and (ii) outperform a VRFT-tuned PID baseline…
Fulvio Forni Team ArXiv
2026-03-31 SafeDMPs: Integrating Formal Safety with DMPs for Adaptive HRI Manipulation
Robots operating in human-centric environments must be both robust to disturbances and provably safe from collisions. Achieving these properties simultaneously and efficiently remains a central challenge. While Dynamic Movement Primitives (DMPs) offer inherent stability and generalization from single demonstrations, they lack formal safety guarantees. Conversely, formal methods like Control…
Ravi Prakash Team ArXiv
2026-03-31 RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment Manipulation
Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen…
Xiu-Shen Wei Team ArXiv
2026-03-31 CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics Manipulation VLA
Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric…
Sung-Eui Yoon Team ArXiv
2026-03-31 Long-Reach Robotic Manipulation for Assembly and Outfitting of Lunar Structures Manipulation
Future infrastructure construction on the lunar surface will require semi- or fully-autonomous operation from robots deployed at the build site. In particular, tasks such as electrical outfitting necessitate transport, routing, and fine manipulation of cables across large structures. To address this need, we present a compact and long-reach manipulator incorporating a deployable composite boom,…
Mark Cutkosky Team ArXiv
2026-03-31 Sampling-Horizon Neural Operator Predictors for Nonlinear Control under Delayed Inputs Manipulation
Modern control systems frequently operate under input delays and sampled state measurements. A common delay-compensation strategy is predictor feedback; however, practical implementations require solving an implicit ODE online, resulting in intractable computational cost. Moreover, predictor formulations typically assume continuously available state measurements, whereas in practice measurements…
Yuanyuan Shi Team ArXiv
2026-03-31 HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling Manipulation
World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal…
Osama Jaber Team ArXiv
2026-03-31 Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning Manipulation
In recent years, advances in underwater networking and multi-agent reinforcement learning (MARL) have significantly expanded multi-autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL-driven cooperative tracking faces three critical challenges: 1) non-stationarity in decentralized coordination, where local policy updates destabilize…
Chen An Team ArXiv
2026-03-30 ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation Manipulation VLA
Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise,…
Xiaodan Liang Team ArXiv
2026-03-30 Feel Robot Feels: Tactile Feedback Array Glove for Dexterous Manipulation Manipulation Dexterous
Teleoperation is a key approach for collecting high-quality, physically consistent demonstrations for robotic manipulation. However, teleoperation for dexterous manipulation remains constrained by: (i) inaccurate hand-robot motion mapping, which limits teleoperated dexterity, and (ii) limited tactile feedback that forces vision-dominated interaction and hinders perception of contact geometry and…
Jiangmiao Pang Team ArXiv
2026-03-30 Tac2Real: Reliable and GPU Visuotactile Simulation for Online Reinforcement Learning and Zero-Shot Real-World Deployment Manipulation Tactile Sim2Real
Visuotactile sensors are indispensable for contact-rich robotic manipulation tasks. However, policy learning with tactile feedback in simulation, especially for online reinforcement learning (RL), remains a critical challenge, as it demands a delicate balance between physics fidelity and computational efficiency. To address this challenge, we present Tac2Real, a lightweight visuotactile…
Jiangmiao Pang Team ArXiv
2026-03-30 LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models Manipulation VLA
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce…
Dahuin Jung Team ArXiv
2026-03-30 Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL Manipulation
Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework…
Amit Roy-Chowdhury Team ArXiv
2026-03-30 Learning Multi-View Spatial Reasoning from Cross-View Relations Manipulation VLA
Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple…
Kimin Lee Team ArXiv
2026-03-30 Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching Manipulation Dexterous
Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human…
Kaizhu Huang Team ArXiv
2026-03-30 HandX: Scaling Bimanual Motion and Interaction Generation Manipulation
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this…
Liang-Yan Gui Team ArXiv
2026-03-30 Enhancing Policy Learning with World-Action Model Manipulation
This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned…
Alper Yilmaz Team ArXiv
2026-03-29 Which Reconstruction Model Should a Robot Use? Routing Image-to-3D Models for Cost-Aware Robotic Manipulation Manipulation
Robotic manipulation tasks require 3D mesh reconstructions of varying quality: dexterous manipulation demands fine-grained surface detail, while collision-free planning tolerates coarser representations. Multiple reconstruction methods offer different cost-quality tradeoffs, from Image-to-3D models - whose output quality depends heavily on the input viewpoint - to view-invariant methods such as…
Leslie Pack Kaelbling Team ArXiv
2026-03-29 Spectral Decomposition of Inverse Dynamics for Fast Exploration in Model-Based Manipulation Manipulation
Planning long duration robotic manipulation sequences is challenging because of the complexity of exploring feasible trajectories through nonlinear contact dynamics and many contact modes. Moreover, this complexity grows with the problem’s horizon length. We propose a search tree method that generates trajectories using the spectral decomposition of the inverse dynamics equation. This equation…
Joel Burdick Team ArXiv
2026-03-29 ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation Manipulation VLA
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical…
Yadong Mu Team ArXiv
2026-03-29 Learning Smooth and Robust Space Robotic Manipulation of Dynamic Target via Inter-frame Correlation Manipulation
On-orbit servicing represents a critical frontier in future aerospace engineering, with the manipulation of dynamic non-cooperative targets serving as a key technology. In microgravity environments, objects are typically free-floating, lacking the support and frictional constraints found on Earth, which significantly escalates the complexity of tasks involving space robotic manipulation….
Panfeng Huang Team ArXiv
2026-03-29 Robotic Dexterous Manipulation via Anisotropic Friction Modulation using Passive Rollers Manipulation
Controlling friction at the fingertip is fundamental to dexterous manipulation, yet remains difficult to realize in robotic hands. We present the design and analysis of a robotic fingertip equipped with passive rollers that can be selectively braked or pivoted to modulate contact friction and constraint directions. When unbraked, the rollers permit unconstrained sliding of the contact point along…
Shenli Yuan Team ArXiv
2026-03-29 FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies Manipulation
Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified…
Bo Dai Team ArXiv
2026-03-24 VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs Manipulation VLA HF-Hot
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially…
Ismini Lourentzou Team ArXiv
2026-03-24 ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment Manipulation HF-Hot
Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates…
Mu Xu Team ArXiv
2026-03-24 Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation Manipulation
While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient…
Shuaicheng Liu Team ArXiv
2026-03-24 Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation Manipulation VLA
Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA…
Yanchao Yang Team ArXiv
2026-03-24 TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches Manipulation VLA
By integrating Chain-of-Thought(CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT-based reasoning mechanisms remains largely unexplored. In this paper, we show that CoT reasoning introduces a novel attack vector for targeted control…
Wenyuan Xu Team ArXiv
2026-03-24 VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents Manipulation
Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language…
Yang Yu Team ArXiv
2026-03-24 DiSCo: Diffusion Sequence Copilots for Shared Autonomy Manipulation
Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user’s goals. To significantly improve the…
Jonathan C. Kao Team ArXiv / Web
2026-03-24 Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints Manipulation
Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online…
Yang Yu Team ArXiv
2026-03-23 IF-CPS: Influence Functions for Cyber-Physical Systems – A Unified Framework for Diagnosis, Curation, and Safety Attribution Manipulation
Neural network controllers trained via behavior cloning are increasingly deployed in cyber-physical systems (CPS), yet practitioners lack tools to trace controller failures back to training data. Existing data attribution methods assume i.i.d.\ data and standard loss targets, ignoring CPS-specific properties: closed-loop dynamics, safety constraints, and temporal trajectory structure. We propose…
Dongmei Chen Team ArXiv
2026-03-23 BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration Manipulation
Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other’s goal-directed action - for instance, pushing the iPad to the table’s edge before picking it up, or…
Hao Dong Team ArXiv
2026-03-23 Trajectory Generation for Underactuated Soft Robot Manipulators using Discrete Elastic Rod Dynamics Manipulation
Soft robots are well suited for contact-rich tasks due to their compliance, yet this property makes accurate and tractable modeling challenging. Planning motions with dynamically-feasible trajectories requires models that capture arbitrary deformations, remain computationally efficient, and are compatible with underactuation. However, existing approaches balance these properties unevenly:…
Andrew P. Sabelhaus Team ArXiv
2026-03-23 CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation Manipulation VLA Sim2Real
“Code-as-Policy” considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents…
Linxi “Jim” Fan Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)

VLA

Publish Date Title & Abstract Authors Links
2026-04-02 Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning Manipulation VLA
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a…
Dongbin Zhao Team ArXiv
2026-04-02 Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models Manipulation VLA HF-Hot 🔥 HF#19
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast,…
Zhaoxia Yin Team ArXiv
2026-04-02 Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior Manipulation VLA
In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property,…
Fei Wen Team ArXiv
2026-04-02 AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation Manipulation VLA
A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action…
Yadan Luo Team ArXiv
2026-04-02 UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models VLA
Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation…
Yonglin Tian Team ArXiv
2026-04-02 UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving VLA HF-Hot 🔥 HF#36
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises:…
Xinggang Wang Team ArXiv
2026-04-02 DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning VLA
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in…
Steven L. Waslander Team ArXiv
2026-04-02 Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving VLA
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN),…
Manabu Tsukada Team ArXiv
2026-04-01 DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale VLA
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As…
Jiwen Lu Team ArXiv
2026-04-01 AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction VLA
Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact…
Mathias Unberath Team ArXiv
2026-03-31 CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics Manipulation VLA
Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric…
Sung-Eui Yoon Team ArXiv
2026-03-31 DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA VLA
The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM’s potential in high-level decision making and introduces training…
Xihui Liu Team ArXiv
2026-03-30 ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation Manipulation VLA
Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise,…
Xiaodan Liang Team ArXiv
2026-03-30 LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models Manipulation VLA
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce…
Dahuin Jung Team ArXiv
2026-03-30 Learning Multi-View Spatial Reasoning from Cross-View Relations Manipulation VLA
Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple…
Kimin Lee Team ArXiv
2026-03-30 FocusVLA: Focused Visual Utilization for Vision-Language-Action Models VLA Dexterous
Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual…
Jia Wan Team ArXiv
2026-03-30 StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation VLA
Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed…
Yu Wang Team ArXiv
2026-03-30 A Wide and Deep Exploration of Radio-detected Active Galactic Nuclei with Subaru HSC (WERGS). XIII. High-Redshift Radio Quasar candidates beyond Ultra-Steep Spectrum Selection: Dropout selection from HSC–VLASS over $\sim$1200 deg$^2$ VLA
We report the results of $g-$, $r-$, and $i-$dropout selections based on optical identifications of Very Large Array Sky Survey (VLASS) radio sources using the Hyper Suprime-Cam Subaru Strategic Program survey (HSC–SSP). By positional crossmatching within $1’‘.5$ between the VLASS Epoch~2 catalog and the HSC–SSP Wide-layer catalog ($i \lesssim 26$), we obtain $\sim$400 high-redshift radio AGN…
Victor Kadri Team ArXiv
2026-03-30 CARLA-Air: Fly Drones Inside a CARLA World – A Unified Infrastructure for Air-Ground Embodied Intelligence VLA
The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic…
Hong Zhang Team ArXiv
2026-03-30 Large Dust Grains and a Possible Dust Trap in the Polar Circumbinary Disc of HD 98800B VLA
HD 98800 is a nearby hierarchical quadruple system comprising two binaries orbiting each other. Surprisingly, despite its $\sim$ 10 Myr age and dynamic environment, the Ba-Bb component is surrounded by a compact gas-rich disc in a polar configuration. Previous millimetre continuum observations of this disc found a low millimetre spectral index ($α\sim$ 2.1 up to 9 mm), potentially arising from…
Catherine C. Espaillat Team ArXiv
2026-03-30 Something Bright at the Edge of Everything: A Uniquely JWST-Dark Radio Source in COSMOS VLA
For decades, astronomers have been searching for bright radio sources deep into the epoch of reionization (EoR). The most distant, powerful radio sources are expected to reside in heavily dust-obscured galaxies, exceedingly faint at optical and infrared wavelengths. Motivated by this, I systematically cross-match radio and JWST source catalogs in the COSMOS field and identify a uniquely JWST-dark…
Mingyu Li ArXiv
2026-03-29 ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation Manipulation VLA
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical…
Yadong Mu Team ArXiv
2026-03-26 LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation VLA
We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this…
Komei Sugiura Team ArXiv
2026-03-25 SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation VLA
Despite the promise of Vision-Language-Action (VLA) models as generalist robotic controllers, their robustness against perceptual noise and environmental variations in out-of-distribution (OOD) tasks remains fundamentally limited by the absence of long-term memory, causal failure attribution, and dynamic intervention capability. To address this, we propose SOMA, a Strategic Orchestration and…
Jinyu Gu Team ArXiv
2026-03-24 Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models Dexterous VLA Sim2Real
Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the…
Guiliang Liu Team ArXiv
2026-03-24 VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs Manipulation VLA HF-Hot
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially…
Ismini Lourentzou Team ArXiv
2026-03-24 Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation Manipulation VLA
Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA…
Yanchao Yang Team ArXiv
2026-03-24 TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches Manipulation VLA
By integrating Chain-of-Thought(CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT-based reasoning mechanisms remains largely unexplored. In this paper, we show that CoT reasoning introduces a novel attack vector for targeted control…
Wenyuan Xu Team ArXiv
2026-03-24 VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models VLA
Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or…
Wei Zhang Team ArXiv
2026-03-24 Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring VLA
Deploying Vision-Language-Action (VLA) models on resource-constrained edge platforms encounters a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic manipulation. To address the challenge, this paper presents Agile-VLA, a hierarchical framework designed for industrial pose reorientation tasks on edge devices such as the NVIDIA Jetson…
Bingzhuo Zhong Team ArXiv
2026-03-24 CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models VLA
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive…
Yang Cai Team ArXiv
2026-03-24 CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation VLA
Navigating unstructured environments requires assessing traversal risk relative to a robot’s physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that…
Girish Chowdhary Team ArXiv
2026-03-24 SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation VLA
Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that…
Zhuowen Tu Team ArXiv
2026-03-24 The Two Component Circumgalactic Medium Emission around z~2 Radio-loud Quasars VLA
We present Ly$α$, He II and C IV observations of 7 redshift ~ 2 radio-loud quasars observed using the Keck Cosmic Web Imager (KCWI) and compare it to observed radio jet emission using archival VLA and ALMA radio observations. We detect 80-120 kpc diameter Ly$α$ and 10-40 kpc He II and C IV emission around the targets. We find the Ly$α$ emission to be brighter in the inner 30 kpc by factors of…
Marie Wingyee Lau Team ArXiv
2026-03-23 UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos Dexterous VLA
Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand…
Huazhe Xu Team ArXiv
2026-03-23 ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling VLA
Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present ROBOGATE, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational…
Byungjin Kim ArXiv / Web
2026-03-23 DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models VLA
Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA…
Haoang Li Team ArXiv
2026-03-23 Do World Action Models Generalize Better than VLAs? A Robustness Study VLA
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks….
Yingxue Zhang Team ArXiv
2026-03-23 VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models VLA
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This “black-box” mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these…
Jiaya Jia Team ArXiv
2026-03-23 AI Token Futures Market: Commoditization of Compute and Derivatives Contract Design VLA
As large language models (LLMs) and vision-language-action models (VLAs) become widely deployed, the tokens consumed by AI inference are evolving into a new type of commodity. This paper systematically analyzes the commodity attributes of tokens, arguing for their transition from intelligent service outputs to compute infrastructure raw materials, and draws comparisons with established…
Yicai Xing ArXiv
2026-03-23 CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation Manipulation VLA Sim2Real
“Code-as-Policy” considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents…
Linxi “Jim” Fan Team ArXiv
2026-03-23 CHANG-ES. XXXVIII. A Thin Radio Halo Shaped by Slow Cosmic-Ray Transport in the Quiescent Galaxy NGC 4565 VLA
We present the VLA C-array S-band (2–4 GHz) radio continuum observations of the nearby edge-on spiral galaxy NGC 4565, a target from the Continuum Halos in Nearby Galaxies - an EVLA (CHANG-ES) Survey. We conduct rotation measure synthesis to probe the magnetic field structure and analyze the vertical radio continuum intensity profiles using the 1-D cosmic ray transportation models. The radio…
Ralf-Jürgen Dettmar Team ArXiv
2026-03-22 RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models VLA HF-Hot
Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to…
Younggyo Seo Team ArXiv
2026-03-19 Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models VLA
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M–7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action…
Peng Wang Team ArXiv
2026-03-19 DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding VLA
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this,…
Jiwen Lu Team ArXiv
2026-03-19 FASTER: Rethinking Real-Time Flow VLAs VLA HF-Hot
Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction…
Hengshuang Zhao Team ArXiv
2026-03-19 From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models VLA
Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency’’ in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In…
Chaojian Li Team ArXiv
2026-03-19 MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model VLA
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce…
Sung Ju Hwang Team ArXiv
2026-03-19 Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds VLA Sim2Real
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the…
Wei Xu Team ArXiv
2026-03-19 AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models VLA
Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug-and-play,…
Yang Liu Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)

Tactile

Publish Date Title & Abstract Authors Links
2026-04-02 Cross-Modal Visuo-Tactile Object Perception Manipulation Tactile
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always…
Mohsen Kaboli Team ArXiv
2026-04-01 How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands Dexterous Manipulation Tactile
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining…
Efi Psomopoulou Team ArXiv
2026-03-30 Tac2Real: Reliable and GPU Visuotactile Simulation for Online Reinforcement Learning and Zero-Shot Real-World Deployment Manipulation Tactile Sim2Real
Visuotactile sensors are indispensable for contact-rich robotic manipulation tasks. However, policy learning with tactile feedback in simulation, especially for online reinforcement learning (RL), remains a critical challenge, as it demands a delicate balance between physics fidelity and computational efficiency. To address this challenge, we present Tac2Real, a lightweight visuotactile…
Jiangmiao Pang Team ArXiv
2026-03-26 When Sensing Varies with Contexts: Context-as-Transform for Tactile Few-Shot Class-Incremental Learning Tactile
Few-Shot Class-Incremental Learning (FSCIL) can be particularly susceptible to acquisition contexts with only a few labeled samples. A typical scenario is tactile sensing, where the acquisition context ({\it e.g.}, diverse devices, contact state, and interaction settings) degrades performance due to a lack of standardization. In this paper, we propose Context-as-Transform FSCIL (CaT-FSCIL) to…
Zheng-Jun Zha Team ArXiv
2026-03-22 Geometrically Plausible Object Pose Refinement using Differentiable Simulation Dexterous Tactile
State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and…
Akansel Cosgun Team ArXiv
2026-03-22 Bayesian Active Object Recognition and 6D Pose Estimation from Multimodal Contact Sensing Tactile
We present an active tactile exploration framework for joint object recognition and 6D pose estimation. The proposed method integrates wrist force/torque sensing, GelSight tactile sensing, and free-space constraints within a Bayesian inference framework that maintains a belief over object class and pose during active tactile exploration. By combining contact and non-contact evidence, the…
Raymond H. Cuijpers Team ArXiv
2026-03-19 Contact Status Recognition and Slip Detection with a Bio-inspired Tactile Hand Tactile
Stable and reliable grasp is critical to robotic manipulations especially for fragile and glazed objects, where the grasp force requires precise control as too large force possibly damages the objects while small force leads to slip and fall-off. Although it is assumed the objects to manipulate is grasped firmly in advance, slip detection and timely prevention are necessary for a robot in…
Longhui Qin Team ArXiv
2026-03-19 ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing Tactile Sim2Real
Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks…
Shan Luo Team ArXiv
2026-03-18 DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation Dexterous Tactile
Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich…
Xiaotian Ding Team ArXiv
2026-03-16 HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing Tactile
Tactile sensing is a crucial capability for Vision-Language-Action (VLA) architectures, as it enables dexterous and safe manipulation in contact-rich tasks. However, reliance on dedicated tactile hardware increases cost and reduces reproducibility across robotic platforms. We argue that tactile-aware manipulation can be learned offline and deployed without direct haptic feedback at inference. To…
Dzmitry Tsetserukou Team ArXiv
2026-03-14 GelSphere: An Omnidirectional Rolling Vision-Based Tactile Sensor for Online 3D Reconstruction and Normal Force Estimation Tactile
We present GelSphere, a spherical vision-based tactile sensor designed for real-time continuous surface scanning. Unlike traditional vision-based tactile sensors that can only sense locally and are damaged when slid across surfaces, and cylindrical tactile sensors that can only roll along a fixed direction, our design enables omnidirectional rolling on surfaces. We accomplish this through our…
Wenzhen Yuan Team ArXiv
2026-03-11 FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation Tactile
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are…
Shuo Wang Team ArXiv
2026-03-11 Learning Bimanual Cloth Manipulation with Vision-based Tactile Sensing via Single Robotic Arm Tactile
Robotic cloth manipulation remains challenging due to the high-dimensional state space of fabrics, their deformable nature, and frequent occlusions that limit vision-based sensing. Although dual-arm systems can mitigate some of these issues, they increase hardware and control complexity. This paper presents Touch G.O.G., a compact vision-based tactile gripper and perception/control framework for…
Petar Kormushev Team ArXiv
2026-03-11 TacLoc: Global Tactile Localization on Objects from a Registration Perspective Tactile
Pose estimation is essential for robotic manipulation, particularly when visual perception is occluded during gripper-object interactions. Existing tactile-based methods generally rely on tactile simulation or pre-trained models, which limits their generalizability and efficiency. In this study, we propose TacLoc, a novel tactile localization framework that formulates the problem as a one-shot…
Huan Yin Team ArXiv
2026-03-10 MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction Tactile
High-fidelity visuo-tactile sensing is important for precise robotic manipulation. However, most vision-based tactile sensors face a fundamental trade-off: opaque coatings enable tactile sensing but block pre-contact vision. To address this, we propose MuxGel, a spatially multiplexed sensor that captures both external visual information and contact-induced tactile signals through a single camera….
Yu She Team ArXiv
2026-03-10 NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors Tactile
Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely…
Chenxi Xiao Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)

Sim2Real

Publish Date Title & Abstract Authors Links
2026-04-01 Simulating Realistic LiDAR Data Under Adverse Weather for Autonomous Vehicles: A Physics-Informed Learning Approach Sim2Real
Accurate LiDAR simulation is crucial for autonomous driving, especially under adverse weather conditions. Existing methods struggle to capture the complex interactions between LiDAR signals and atmospheric phenomena, leading to unrealistic representations. This paper presents a physics-informed learning framework (PICWGAN) for generating realistic LiDAR data under adverse weather conditions. By…
Gaurav Pandey Team ArXiv
2026-03-31 Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models Manipulation Sim2Real
This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This…
Mohd Suhaib Team ArXiv
2026-03-30 Tac2Real: Reliable and GPU Visuotactile Simulation for Online Reinforcement Learning and Zero-Shot Real-World Deployment Manipulation Tactile Sim2Real
Visuotactile sensors are indispensable for contact-rich robotic manipulation tasks. However, policy learning with tactile feedback in simulation, especially for online reinforcement learning (RL), remains a critical challenge, as it demands a delicate balance between physics fidelity and computational efficiency. To address this challenge, we present Tac2Real, a lightweight visuotactile…
Jiangmiao Pang Team ArXiv
2026-03-30 Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim Sim2Real
This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on…
Martina Hutter-Mironovova ArXiv
2026-03-30 Dynamic Lookahead Distance via Reinforcement Learning-Based Pure Pursuit for Autonomous Racing Sim2Real
Pure Pursuit (PP) is a widely used path-tracking algorithm in autonomous vehicles due to its simplicity and real-time performance. However, its effectiveness is sensitive to the choice of lookahead distance: shorter values improve cornering but can cause instability on straights, while longer values improve smoothness but reduce accuracy in curves. We propose a hybrid control framework that…
Amr S. El-Wakeel Team ArXiv
2026-03-30 A Classification of Heterogeneity in Uncrewed Vehicle Swarms and the Effects of Its Inclusion on Overall Swarm Resilience Sim2Real
Combining different types of agents in uncrewed vehicle (UV) swarms has emerged as an approach to enhance mission resilience and operational capabilities across a wide range of applications. This study offers a systematic framework for grouping different types of swarms based on three main factors: agent nature (behavior and function), hardware structure (physical configuration and sensing…
F. Antonio Medrano Team ArXiv
2026-03-26 Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning Sim2Real
Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream…
Stelian Coros Team ArXiv
2026-03-24 Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models Dexterous VLA Sim2Real
Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the…
Guiliang Liu Team ArXiv
2026-03-24 TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation Sim2Real
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion…
Seungryong Kim Team ArXiv
2026-03-24 RealMaster: Lifting Rendered Scenes into Photorealistic Video Sim2Real HF-Hot
State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet…
Amit Zohar Team ArXiv
2026-03-24 Learning Safe-Stoppability Monitors for Humanoid Robots Sim2Real
Emergency stop (E-stop) mechanisms are the de facto standard for robot safety. However, for humanoid robots, abruptly cutting power can itself cause catastrophic failures; instead, an emergency stop must execute a predefined fallback controller that preserves balance and drives the robot toward a minimum-risk condition. This raises a critical question: from which states can a humanoid robot…
Changliu Liu Team ArXiv
2026-03-23 DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming Dexterous Sim2Real
Performing in-hand, contact-rich, and long-horizon dexterous manipulation remains an unsolved challenge in robotics. Prior hand dexterity works have considered each of these three challenges in isolation, yet do not combine these skills into a single, complex task. To further test the capabilities of dexterity, we propose drumming as a testbed for dexterous manipulation. Drumming naturally…
Dorsa Sadigh Team ArXiv
2026-03-23 PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation Sim2Real
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks…
Hao Zhao Team ArXiv
2026-03-23 RAFL: Generalizable Sim-to-Real of Soft Robots with Residual Acceleration Field Learning Sim2Real
Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when…
Boyuan Chen Team ArXiv
2026-03-23 Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection Sim2Real
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a…
Jaeheung Park Team ArXiv
2026-03-23 CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation Manipulation VLA Sim2Real
“Code-as-Policy” considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents…
Linxi “Jim” Fan Team ArXiv
2026-03-22 Evaluating Factor-Wise Auxiliary Dynamics Supervision for Latent Structure and Robustness in Simulated Humanoid Locomotion Sim2Real
We evaluate whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. DynaMITE – a transformer encoder with a factored 24-d latent trained by per-factor auxiliary losses during proximal policy optimization (PPO) – is compared against Long Short-Term Memory (LSTM), plain Transformer, and Multilayer Perceptron…
Chayanin Chamachot ArXiv
2026-03-19 Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds VLA Sim2Real
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the…
Wei Xu Team ArXiv
2026-03-19 ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing Tactile Sim2Real
Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks…
Shan Luo Team ArXiv
2026-03-19 Introducing M: A Modular, Modifiable Social Robot Sim2Real
We present M, an open-source, low-cost social robot platform designed to reduce platform friction that slows social robotics research by making robots easier to reproduce, modify, and deploy in real-world settings. M combines a modular mechanical design, multimodal sensing, and expressive yet mechanically simple actuation architecture with a ROS2-native software package that cleanly separates…
Chien-Ming Huang Team ArXiv
2026-03-19 Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning Sim2Real
Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this…
Josiah Hanna Team ArXiv
2026-03-19 V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors Sim2Real
Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert…
Yang Gao Team ArXiv
2026-03-19 Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer Sim2Real
Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents…
Andreas Bulling Team ArXiv
2026-03-19 PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors Sim2Real
Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet…
Houde Liu Team ArXiv
2026-03-18 From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving Sim2Real
Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for…
A. Behera Team ArXiv
2026-03-17 MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation HF-Hot Sim2Real
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot…
Ranjay Krishna Team ArXiv
2026-03-17 Fast and Reliable Gradients for Deformables Across Frictional Contact Regimes Dexterous Sim2Real
Differentiable simulation establishes the mathematical foundation for solving challenging inverse problems in computer graphics and robotics, such as physical system identification and inverse dynamics control. However, rigor in frictional contact remains the “elephant in the room.” Current frameworks often avoid contact singularities via non-Markovian position approximations or heuristic…
Fan Shi Team ArXiv
2026-03-17 Efficient and Reliable Teleoperation through Real-to-Sim-to-Real Shared Autonomy Sim2Real
Fine-grained, contact-rich teleoperation remains slow, error-prone, and unreliable in real-world manipulation tasks, even for experienced operators. Shared autonomy offers a promising way to improve performance by combining human intent with automated assistance, but learning effective assistance in simulation requires a faithful model of human behavior, which is difficult to obtain in practice….
Yunzhu Li Team ArXiv
2026-03-17 SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion Sim2Real
Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for…
Hsiu-Chin Lin Team ArXiv
2026-03-17 Learning Whole-Body Control for a Salamander Robot Sim2Real LearnedControl
Amphibious legged robots inspired by salamanders are promising in applications in complex amphibious environments. However, despite the significant success of training controllers that achieve diverse locomotion behaviors in conventional quadrupedal robots, most salamander robots relied on central-pattern-generator (CPG)-based and model-based coordination strategies for locomotion control….
Auke Ijspeert Team ArXiv
2026-03-17 ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control Sim2Real LearnedControl
We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion…
Yutao Yue Team ArXiv
2026-03-16 HALO:Closing Sim-to-Real Gap for Heavy-loaded Humanoid Agile Motion Skills via Differentiable Simulation Sim2Real
Humanoid robots deployed in real-world scenarios often need to carry unknown payloads, which introduce significant mismatch and degrade the effectiveness of simulation-to-reality reinforcement learning methods. To address this challenge, we propose a two-stage gradient-based system identification framework built on the differentiable simulator MuJoCo XLA. The first stage calibrates the nominal…
Shiqiang Zhu Team ArXiv
2026-03-16 CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control Sim2Real
Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics, however, conventional control strategies often struggle with their underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, this paper presents CycleRL, the first sim-to-real deep reinforcement learning…
Xiangwei Zhu Team ArXiv
2026-03-16 Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning Dexterous Sim2Real
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale…
Abhishek Gupta Team ArXiv
2026-03-16 ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors Sim2Real
Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to…
Karl Schmeckpeper Team ArXiv
2026-03-16 Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation Sim2Real
Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce…
David Fridovich-Keil Team ArXiv
2026-03-15 On Globally Optimal Stochastic Policy Gradient Methods for Domain Randomized LQR Synthesis Sim2Real
Domain randomization is a simple, effective, and flexible scheme for obtaining robust feedback policies aimed at reducing the sim-to-real gap due to model mismatch. While domain randomization methods have yielded impressive demonstrations in the robotics-learning literature, general and theoretically motivated principles for designing optimization schemes that effectively leverage the…
Nikolai Matni Team ArXiv
2026-03-14 TransCurriculum: Multi-Dimensional Curriculum Learning for Fast & Stable Locomotion Sim2Real
High-speed legged locomotion struggles with stability and transfer losses at higher command velocities during deployment. One reason is that most curricula vary difficulty along single axis, for example increase the range of command velocities, terrain difficulty, or domain parameters (e.g. friction or payload mass) using either fixed update rule or instantaneous rewards while ignoring how the…
Dinesh Manocha Team ArXiv
2026-03-14 Robust Sim-to-Real Cloth Untangling through Reduced-Resolution Observations via Adaptive Force-Difference Quantization Sim2Real
Robotic cloth untangling requires progressively disentangling fabric by adapting pulling actions to changing contact and tension conditions. Because large-scale real-world training is impractical due to cloth damage and hardware wear, sim-to-real policy transfer is a promising solution. However, cloth manipulation is highly sensitive to interaction dynamics, and policies that depend on precise…
Takamitsu Matsubara Team ArXiv
2026-03-13 Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis Sim2Real
While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse,…
Zhiwen Fan Team ArXiv
2026-03-13 SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation Sim2Real
A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation – from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require…
Mathias Unberath Team ArXiv
2026-03-13 FLUX: Accelerating Cross-Embodiment Generative Navigation Policies via Rectified Flow and Static-to-Dynamic Learning Sim2Real
Autonomous navigation requires a broad spectrum of skills, from static goal-reaching to dynamic social traversal, yet evaluation remains fragmented across disparate protocols. We introduce DynBench, a dynamic navigation benchmark featuring physically valid crowd simulation. Combined with existing static protocols, it supports comprehensive evaluation across six fundamental navigation tasks….
Junwei Liang Team ArXiv
2026-03-13 Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data Sim2Real
Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic…
Li Yi Team ArXiv
2026-03-13 CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning Sim2Real
Distributed reinforcement learning policies face network delays, jitter, and packet loss when deployed across edge devices and cloud servers. Standard RL training assumes zero-latency interaction, causing severe performance degradation under realistic network conditions. We introduce CALF (Communication-Aware Learning Framework), which trains policies under realistic network models during…
Pietro Lio’ Team ArXiv
2026-03-12 ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control Dexterous Sim2Real
Physics simulation for contact-rich robotics is often bottlenecked by contact resolution: mainstream engines enforce non-penetration and Coulomb friction via complementarity constraints or constrained optimization, requiring per-step iterative solves whose cost grows superlinearly with contact density. We present ComFree-Sim, a GPU-parallelized analytical contact physics engine built on…
Wanxin Jin Team ArXiv
2026-03-12 Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application Sim2Real
Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the “sim-to-real” gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking…
Pere Ridao Team ArXiv
2026-03-12 A Hybrid Neural-Assisted Unscented Kalman Filter for Unmanned Ground Vehicle Navigation Sim2Real
Modern autonomous navigation for unmanned ground vehicles relies on different estimators to fuse inertial sensors and GNSS measurements. However, the constant noise covariance matrices often struggle to account for dynamic real-world conditions. In this work we propose a hybrid estimation framework that bridges classical state estimation foundations with modern deep learning approaches. Instead…
Itzik Klein Team ArXiv
2026-03-11 ASTER: Attitude-aware Suspended-payload Quadrotor Traversal via Efficient Reinforcement Learning Sim2Real
Agile maneuvering of the quadrotor cable-suspended system is significantly hindered by its non-smooth hybrid dynamics. While model-free Reinforcement Learning (RL) circumvents explicit differentiation of complex models, achieving attitude-constrained or inverted flight remains an open challenge due to the extreme reward sparsity under strict orientation requirements. This paper presents ASTER, a…
Shuo Li Team ArXiv
2026-03-11 MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers Sim2Real
Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end…
Shuo Li Team ArXiv
2026-03-11 SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning Sim2Real
Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on…
Michael Yip Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)

LearnedControl

Publish Date Title & Abstract Authors Links
2026-04-02 MorphoGuard: A Morphology-Based Whole-Body Interactive Motion Controller LearnedControl
Whole-body control (WBC) has demonstrated significant advantages in complex interactive movements of high-dimensional robotic systems. However, when a robot is required to handle dynamic multi-contact combinations along a single kinematic chain-such as pushing open a door with its elbow while grasping an object-it faces major obstacles in terms of complex contact representation and joint…
Bin He Team ArXiv
2026-04-01 SMASH: Mastering Scalable Whole-Body Skills for Humanoid Ping-Pong with Egocentric Vision LearnedControl
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning…
Ping Luo Team ArXiv
2026-04-01 BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control LearnedControl
Despite recent advances in control, reinforcement learning, and imitation learning, developing a unified framework that can achieve agile, precise, and robust whole-body behaviors, particularly in long-horizon tasks, remains challenging. Existing approaches typically follow two paradigms: coupled whole-body policies for global coordination and decoupled policies for modular precision. However,…
Sehoon Ha Team ArXiv
2026-03-31 DreamControl-v2: Simpler and Scalable Autonomous Humanoid Skills via Trainable Guided Diffusion Priors LearnedControl
Developing robust autonomous loco-manipulation skills for humanoids remains an open problem in robotics. While RL has been applied successfully to legged locomotion, applying it to complex, interaction-rich manipulation tasks is harder given long-horizon planning challenges for manipulation. A recent approach along these lines is DreamControl, which addresses these issues by leveraging…
Jonathan Chung-Kuan Huang Team ArXiv
2026-03-30 Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation LearnedControl
The complexity of teaching humanoid robots new tasks is one of the major reasons hindering their widespread adoption in the industry. While Imitation Learning (IL), particularly Action Chunking with Transformers (ACT), enables rapid task acquisition, there is no consensus yet on the optimal sensory hardware required for manipulation tasks. This paper benchmarks 14 sensor combinations on the…
Dennis Bank Team ArXiv
2026-03-25 SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating LearnedControl
Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller or unsafe for real-world deployment. These failures often arise from the lack of…
Donghan Koo Team ArXiv
2026-03-23 Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control LearnedControl
Humanoid robots require diverse motor skills to integrate into complex environments, but bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. We demonstrate through Hessian analysis that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. To…
Xun Cao Team ArXiv
2026-03-19 ADMM-Based Distributed MPC with Control Barrier Functions for Safe Multi-Robot Quadrupedal Locomotion LearnedControl
This paper proposes a fully decentralized model predictive control (MPC) framework with control barrier function (CBF) constraints for safety-critical trajectory planning in multi-robot legged systems. The incorporation of CBF constraints introduces explicit inter-agent coupling, which prevents direct decomposition of the resulting optimal control problems. To address this challenge, we…
Kaveh Akbari Hamed Team ArXiv
2026-03-17 Learning Whole-Body Control for a Salamander Robot Sim2Real LearnedControl
Amphibious legged robots inspired by salamanders are promising in applications in complex amphibious environments. However, despite the significant success of training controllers that achieve diverse locomotion behaviors in conventional quadrupedal robots, most salamander robots relied on central-pattern-generator (CPG)-based and model-based coordination strategies for locomotion control….
Auke Ijspeert Team ArXiv
2026-03-17 ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control Sim2Real LearnedControl
We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion…
Yutao Yue Team ArXiv
2026-03-15 CyboRacket: A Perception-to-Action Framework for Humanoid Racket Sports LearnedControl
Dynamic ball-interaction tasks remain challenging for robots because they require tight perception-action coupling under limited reaction time. This challenge is especially pronounced in humanoid racket sports, where successful interception depends on accurate visual tracking, trajectory prediction, coordinated stepping, and stable whole-body striking. Existing robotic racket-sport systems often…
Kai Chen Team ArXiv
2026-03-15 Load-Aware Locomotion Control for Humanoid Robots in Industrial Transportation Tasks LearnedControl
Humanoid robots deployed in industrial environments are required to perform load-carrying transportation tasks that tightly couple locomotion and manipulation. However, achieving stable and robust locomotion under varying payloads and upper-body motions is challenging due to dynamic coupling and partial observability. This paper presents a load-aware locomotion framework for industrial humanoids…
Shiqi Li Team ArXiv
2026-03-14 REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning LearnedControl
Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller,…
Ye Zhao Team ArXiv
2026-03-13 PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization LearnedControl
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC…
Ivan Laptev Team ArXiv
2026-03-12 $Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation LearnedControl
We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data…
Yue Wang Team ArXiv
2026-03-11 Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation LearnedControl
Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable…
Kai Chen Team ArXiv
2026-03-10 ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video LearnedControl
Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or…
Xuelong Li Team ArXiv

(<a href=#updated-on-20260406>back to top</a>)