VLA

Updated on 2026.06.11

Usage instructions: here

Publish Date	Title & Abstract	Authors	Links
2026-06-09	Task Robustness via Re-Labelling Vision-Action Robot Data `Dexterous` `Manipulation` `VLA` The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling…	Glen Berseth Team	ArXiv / Web
2026-06-09	LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination `Dexterous` `VLA` Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce…	Zhongyu Wei Team	ArXiv
2026-06-09	Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations `Dexterous` `Manipulation` `VLA` Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous…	Jinwoo Shin Team	ArXiv
2026-06-09	VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models `Dexterous` `Manipulation` `VLA` Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and…	Jing Liu Team	ArXiv
2026-06-09	AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness `VLA` Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model’s action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating…	Tong Qin Team	ArXiv
2026-06-09	Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults `VLA` Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are…	Taesup Kim Team	ArXiv
2026-06-09	Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models `VLA` Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address…	Dinesh Manocha Team	ArXiv
2026-06-09	A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation `VLA` `Sim2Real` Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation….	Yang Gao Team	ArXiv
2026-06-09	Rethinking Embodied Navigation via Relational Inductive Bias `VLA` Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives,…	Cheng Deng Team	ArXiv
2026-06-09	SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation `VLA` Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense,…	Mac Schwager Team	ArXiv
2026-06-09	What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents `VLA` Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect…	Annie Xie Team	ArXiv
2026-06-08	MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models `Dexterous` `Manipulation` `VLA` Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal…	Gao Huang Team	ArXiv / Web
2026-06-08	Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models `Dexterous` `Manipulation` `VLA` Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow…	Nader Sehatbakhsh Team	ArXiv
2026-06-08	ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models `Dexterous` `Manipulation` `VLA` Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5…	Nader Sehatbakhsh Team	ArXiv
2026-06-08	ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies `Dexterous` `Manipulation` `VLA` `Sim2Real` Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA – a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a…	Toshiaki Koike-Akino Team	ArXiv
2026-06-08	CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control `Dexterous` `Manipulation` `VLA` Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact…	Jiahu Qin Team	ArXiv
2026-06-08	Targeting World Models to Compromise Robot Learning Pipelines `Manipulation` `VLA` World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning…	Eugene Bagdasarian Team	ArXiv
2026-06-08	OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics `VLA` Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We…	Xiaojuan Qi Team	ArXiv
2026-06-08	SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks `VLA` Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for…	Yinpeng Dong Team	ArXiv
2026-06-08	Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer `VLA` Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages…	Kyung-Joon Park Team	ArXiv
2026-06-08	RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour `VLA` We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as…	Sawradip Saha Team	ArXiv
2026-06-08	TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation `VLA` Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as…	Baoxu Liu Team	ArXiv
2026-06-08	Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection `VLA` Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven…	Byoung-Tak Zhang Team	ArXiv
2026-06-08	MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation `VLA` `LearnedControl` World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body…	Junwei Liang Team	ArXiv
2026-06-08	Scaling by Diversified Experience for Vision-Language-Action Models `VLA` Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts…	Nanyang Ye Team	ArXiv
2026-06-08	C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache `VLA` World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires…	Yuzhang Shang Team	ArXiv
2026-06-08	YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale `VLA` We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for…	Kei Ota Team	ArXiv / Web
2026-06-08	Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs `VLA` We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples…	Andy Wang Team	ArXiv
2026-06-07	Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis `VLA` Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the…	Xinchuan Qiu Team	ArXiv
2026-06-07	BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving `VLA` We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do…	Zhongzhan Huang Team	ArXiv
2026-06-07	Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language `VLA` Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information (“I left my backpack on the table”) that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion…	Andreea Bobu Team	ArXiv
2026-06-07	FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning `VLA` Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves…	Jiahui Du Team	ArXiv / Web
2026-06-07	Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions `VLA` Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and…	Georgios Th. Papadopoulos Team	ArXiv
2026-06-05	Spline Policy: A Structured Representation for Robot Policies `Dexterous` `Manipulation` `VLA` Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted…	Sylvain Calinon Team	ArXiv
2026-06-05	RhinoVLA Technical Report `Dexterous` `Manipulation` `VLA` Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by…	Yuxi Liu Team	ArXiv
2026-06-05	Robotic Policy Adaptation via Weight-Space Meta-Learning `Dexterous` `Manipulation` `VLA` `HF-Hot` Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We…	Luca Franco Team	ArXiv
2026-06-05	Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models `Dexterous` `VLA` Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens…	Yu-Gang Jiang Team	ArXiv
2026-06-05	LARA: Latent Action Representation Alignment for Vision-Language-Action Models `Dexterous` `Manipulation` `VLA` Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from…	Siyuan Huang Team	ArXiv
2026-06-05	MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism `VLA` `HF-Hot` Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a…	Chunhua Shen Team	ArXiv
2026-06-05	ActionMap: Robot Policy Learning via Voxel Action Heatmap `VLA` Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone’s hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising,…	Mike Zheng Shou Team	ArXiv
2026-06-05	Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation `VLA` Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce…	Si Liu Team	ArXiv
2026-06-04	TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies `Dexterous` `Manipulation` `VLA` Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from…	Mingyu Ding Team	ArXiv
2026-06-04	AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding `Dexterous` `Manipulation` `VLA` Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception–action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified…	Yingcong Chen Team	ArXiv / Web
2026-06-04	Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators `VLA` While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we…	Xihui Liu Team	ArXiv / Web
2026-06-04	MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action `VLA` Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$…	Lianlei Shan Team	ArXiv
2026-06-04	WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation `VLA` End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to “imagine” future states – inherent in World Models – is…	Xiao-Ping Zhang Team	ArXiv
2026-06-04	Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation `VLA` While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and…	Xiu Li Team	ArXiv
2026-06-04	A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models `VLA` This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while…	Roel Pieters Team	ArXiv
2026-06-04	World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis `VLA` We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities…	Zhijie Deng Team	ArXiv
2026-06-04	T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation `VLA` Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them…	Reza Sabzevari Team	ArXiv
2026-06-04	PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation `VLA` Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA…	Hanli Wang Team	ArXiv
2026-06-04	DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models `VLA` Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this…	Yin Li Team	ArXiv
2026-06-04	Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models `VLA` Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action…	Xipeng Qiu Team	ArXiv
2026-06-04	VLA Observations Confirm AT 2023mfm as an Off-nuclear Tidal Disruption Event `VLA` We report new radio observations of the tidal disruption event (TDE) AT 2023mfm, which we identified as a high-confidence candidate in a systematic search for off-nuclear TDEs. High-resolution NSF Karl G. Jansky Very Large Array C-band (6 GHz) imaging resolves two radio sources: one consistent with the host-galaxy nucleus and one offset by $0.651\pm0.036^{\prime\prime}$ ($1.06\pm0.06$ kpc),…	Jimmy Lynch Team	ArXiv
2026-06-04	Robots Need More than VLA and World Models `VLA` Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world’s abundant unstructured…	Haitham Bou-Ammar Team	ArXiv
2026-06-03	Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement `Dexterous` `VLA` Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and…	Gang Wang Team	ArXiv
2026-06-03	HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning `Dexterous` `VLA` `Tactile` Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite…	Shan Luo Team	ArXiv
2026-06-03	VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training `Dexterous` `VLA` Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are…	Xuelong Li Team	ArXiv
2026-06-03	3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training `Dexterous` `VLA` We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work…	Weihao Yuan Team	ArXiv
2026-06-03	FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization `VLA` Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for…	Zhengyou Zhang Team	ArXiv
2026-06-03	Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety `VLA` Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS…	Maria Spence Team	ArXiv
2026-06-03	Brick-Composer: Using MLLMs for Assembly with Diverse Bricks `VLA` We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves…	Heng Ji Team	ArXiv
2026-06-03	BIDENT: Heterogeneous Operator-level Mapping for Efficient Edge Inference `VLA` Modern edge System-on-Chips (SoCs) integrate heterogeneous processing units (PUs) such as CPUs, GPUs, and NPUs, yet current inference stacks map entire models to a single PU, leaving significant performance and energy efficiency on the table. This is exacerbated by emerging architectures such as state-space models (SSMs), Kolmogorov-Arnold networks (KANs), and multi-stage vision-language-action…	Vijay Raghunathan Team	ArXiv
2026-06-02	VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring `VLA` As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount – physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted….	Changliu Liu Team	ArXiv
2026-06-02	CLI-Anything: Towards Agent-Native Computer Use `VLA` As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This…	Chao Huang Team	ArXiv
2026-06-02	Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation `VLA` `Dexterous` `Manipulation` Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through…	Huaping Liu Team	ArXiv
2026-06-02	Same Weights, Different Robot: A Deployment Safety View of VLA Policies `VLA` Vision-language-action (VLA) policies are often treated as checkpoint-defined objects: if the weights, prompt, and benchmark suite match, the deployment is assumed to be the same policy. Robot execution breaks this assumption because the same normalized model output can become a different physical action after action unnormalization and controller conventions are applied. This creates a…	Jianwei Tai	ArXiv
2026-06-02	PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models `VLA` `Dexterous` `Manipulation` Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive…	Yandong Guo Team	ArXiv
2026-06-02	Partially Observable Adversarial Patch Attacks on Vision-Language-Action Models in Robotics `VLA` Vision-language-action (VLA) models are gaining attention in robotics, yet their robustness to adversarial attacks remains largely unexplored. Existing work shows that adversarial patches can mislead VLA-based robots but assumes full access to the entire execution trajectory, an unrealistic requirement in practice. We address this limitation by formulating a partially observable threat model,…	Keke Tang Team	ArXiv
2026-06-02	OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform `VLA` `Dexterous` Embodied AI in the real world requires both accurate hardware and robust vision-language-action (VLA) policies. We present OpenEAI-Platform, a fully open-source platform that integrates a low-cost 6+1 degree-of-freedom (dof) robotic arm (OpenEAI-Arm) and a reproducible VLA model (OpenEAI-VLA). OpenEAI-Arm provides open-source mechanical designs for low manufacturing cost and compliant control…	Nanyang Ye Team	ArXiv
2026-06-02	Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation `VLA` `Dexterous` `Manipulation` In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning…	Wanyuan Wang Team	ArXiv / Web
2026-06-02	GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models `VLA` Current Vision–Language–Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived…	Yue Gao Team	ArXiv
2026-06-02	NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation `VLA` As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural…	Zian Wang Team	ArXiv
2026-06-02	TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models `VLA` Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training…	Xiao Ma Team	ArXiv
2026-06-02	Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation `VLA` In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the…	Kun Zhou Team	ArXiv
2026-06-01	Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation `Dexterous` `VLA` Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation…	Jeffrey Ichnowski Team	ArXiv
2026-06-01	RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models `Dexterous` `VLA` Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an…	Kai Chen Team	ArXiv / Web
2026-06-01	Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning `Dexterous` `VLA` End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces…	Kui Jia Team	ArXiv
2026-06-01	Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO `VLA` Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate…	Fei Gao Team	ArXiv
2026-06-01	FATE-VLA:Failue-aware test generation for vision-language-action models `VLA` Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a…	Aitor Arrieta Team	ArXiv
2026-06-01	WALL-WM: Carving World Action Modeling at the Event Joints `VLA` `HF-Hot` WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and…	Qian Wang Team	ArXiv
2026-06-01	Co-training with Ego-centric Video and Demonstration for Robot Navigation Task `VLA` Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to…	Kanata Suzuki Team	ArXiv
2026-06-01	The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space `VLA` Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic…	Chun-Yi Lee Team	ArXiv
2026-06-01	TimeLogic Challenge @ CVPR 2026: Strong MLLMs Meet Evidence-Seeking Agents for Temporal-Logic Video Question Answering `VLA` Temporal-logic video question answering requires a model to reason about when actions occur relative to one another, such as before, after, until, since, overlap, and multi-event chains, rather than merely what is present in a video. Standard vision-language models typically answer such questions in a single pass over a fixed, uniformly sampled set of frames, which is poorly matched to evidence…	Jianlong Wu Team	ArXiv
2026-06-01	Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation `VLA` Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates…	Wenshan Wang Team	ArXiv
2026-06-01	Cosmos 3: Omnimodal World Models for Physical AI `VLA` We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators,…	Artur Zolkowski Team	ArXiv
2026-06-01	AURA: Action-Gated Memory for Robot Policies at Constant VRAM `VLA` The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather…	Josef Chen	ArXiv
2026-06-01	SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos `VLA` Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot demo-conditioned VLAs, where a robot policy is conditioned on a single demonstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful…	Zsolt Kira Team	ArXiv
2026-06-01	See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs `VLA` Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the…	Kei Ota Team	ArXiv / Web
2026-05-31	Make Your VLA More Robust Without More Data By Interleaving Motion Planning `Manipulation` `VLA` Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These…	Shreyas Kousik Team	ArXiv
2026-05-31	Threading Optimization for Vision-Language-Action Model Inference in Low-Cost Smart Agricultural Manipulation `Manipulation` `VLA` Vision-Language Action (VLA) models continue to face challenges such as slow inference speed and difficulty performing fine-grained motion adjustments, limiting their widespread adoption in industry. While the Real-Time Action Chunking (RTAC) algorithm has been proposed to address these bottlenecks, bridging the gap between the algorithm provided in pseudocode to a stable, real-world deployment…	Lifeng Zhou Team	ArXiv
2026-05-31	LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World `VLA` `LearnedControl` Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that…	Mac Schwager Team	ArXiv / Web
2026-05-31	OneVLA: A Unified Framework for Embodied Tasks `VLA` Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic…	Wenbo Ding Team	ArXiv
2026-05-31	ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning `VLA` Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling….	Jiankun Yang Team	ArXiv / Web
2026-05-31	Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies `VLA` Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight…	Ozgur S. Oguz Team	ArXiv
2026-05-31	Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA `VLA` Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal…	Tuan Do Team	ArXiv
2026-05-30	PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation `Manipulation` `VLA` Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a…	Tianrui Li Team	ArXiv
2026-05-30	Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated `VLA` Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs…	Rashid Mushkani	ArXiv
2026-05-30	SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models `VLA` Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing…	Fanxin Kong Team	ArXiv
2026-05-30	Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion `VLA` Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward…	Emad Barsoum Team	ArXiv