Insight

Research

Papers and technical reports from the RLWRLD team.

Robot-Factored World Models via Robot Rendering

Action-conditioned video world models predict future observations from an initial observation and an action signal. In robotics, actions influence future observations through two distinct processes: they are first realized into robot motion…

arxiv Jul 30, 2026
Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-…

arxiv Jun 9, 2026
Efficient Timing-Aware Planning via Detachable Task Modeling

Propose a Timing-Aware Detachable-task planning (TAD) approach that can be integrated into existing task planners to enable Timing-Aware Detachable-task planning without modifying the original planner's structure.

IROS 2026 Jun 30, 2026
Natural Functional Gradients for Smooth Trajectory Optimization

A trajectory optimization framework that performs geometry-aware updates directly in function space via natural functional gradients, producing collision-free, smooth motions in constrained environments.

RSS 2026 May 27, 2026
TRQAM: Trust Region Q-Adjoint Matching — Stable Off-Policy RL for Flow Policies

TRQAM internalizes a trust-region parameter λ into the sampling dynamics of pretrained flow policies, controlling path-space KL in closed form via projected dual descent for stable off-policy RL. Fine-tunes RLDX-1 to a new SOTA on the GR1 T…

arxiv May 26, 2026
MoSS: Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

MoSS augments VLAs to leverage multiple heterogeneous physical signals (tactile, torque) for action prediction, via decoupled modality streams fused into the action stream through joint cross-modal self-attention.

arxiv Apr 25, 2026
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

DeVI leverages text-conditioned synthetic videos as imitation targets, combining 3D human tracking with robust 2D object tracking into a hybrid reward, to learn physically plausible dexterous manipulation that generalizes zero-shot to unsee…

arxiv Apr 22, 2026
MOSS: Exploring High-Order Self-Similarity for Video Understanding

MOSS is a lightweight module that learns and integrates multi-order space-time self-similarity (STSS) features, boosting temporal/motion modeling across diverse video tasks at marginal compute cost.

arxiv Apr 22, 2026
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Grasps

A large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and robotic hands, enabling direct human-robot grasp comparison.

arxiv Apr 16, 2026
VLMPose: Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Without any additional training, a VLM agent iteratively refines an object's 6D pose to follow text instructions in a closed loop.

arxiv Apr 10, 2026
Heavy Lifting: How Much Heavy Lifting Can an Agent Harness Do? — Measuring the LLM's Residual Role in a Planning Agent

Declarative planning (zero LLM calls) carries the "heavy lifting" of a planning agent (+24.1pp win rate over a belief-only harness) while the LLM-backed revision gate activates on only 4.3% of turns.

arxiv Apr 8, 2026
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

arxiv Apr 6, 2026
GRIT: Learning Dexterous Grasping from Sparse Taxonomy Guidance

GRIT enables dexterous robotic grasping through high-level taxonomy commands that guide low-level continuous control, reaching an 87.9% success rate with strong generalization to novel objects.

arxiv Apr 5, 2026
VaLR: Vision-aligned Latent Reasoning for Multi-modal Large Language Model

VaLR dynamically generates vision-aligned latent tokens before each Chain-of-Thought step, preserving visual information during long reasoning and unlocking test-time scaling.

ICML 2026 Apr 1, 2026
Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

Cog3DMap constructs an explicit 3D cognitive map from multi-view images, grounding each token in 3D space with both semantic and geometric information, enabling MLLMs to directly reason over spatially structured representations for SOTA spa…

arxiv Mar 31, 2026
RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

RoboAlign uses RL-based reasoning alignment to bridge the language-action modality gap in VLAs, achieving up to 106.6% improvement over SFT baselines on real-world robotics tasks with less than 1% additional data.

arxiv Mar 31, 2026
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

SpatialBoost enhances the visual representation through 3D information with linguistic format.

ECCV 2026 Mar 31, 2026
RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Neural trajectory curation framework that diversifies synthetic robot data via controllable video generation (I2I/V2V) and filters low-quality samples by comparing motion consistency between generated video and simulator replay

arxiv Feb 25, 2026
Affostruction: 3D Affordance Grounding with Generative Reconstruction

Generative reconstruction to complete occluded regions and ground affordances on full 3D shapes

CVPR 2026 Feb 25, 2026
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

CompACT compresses each observation into just 8 discrete tokens, enabling orders-of-magnitude faster planning in latent world models

CVPR 2026 Feb 25, 2026
MoGaF: Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

Long-term stable scene forecasting via motion-aware Gaussian grouping

CVPR 2026 Feb 25, 2026
Improving Text-to-Image Generation with Intrinsic Self-Confi dence Rewards

Post-training T2I generators with the model's own self-confidence as reward, improving compositionality and text-image alignment without external reward models

CVPR 2026 Feb 25, 2026
Dexterous World Models

Proposes DWM (Dexterous World Model), a video diffusion framework that generates temporally coherent human-scene interaction videos by conditioning on static 3D scene renderings and egocentric hand mesh sequences.

CVPR 2026 Dec 19, 2025
Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Arena-Lite introduces a tournament-based direct comparison approach for LLM evaluation that eliminates the need for baseline outputs. By using head-to-head comparisons between systems, it achieves higher reliability with fewer comparisons t…

EMNLP 2025 Nov 2, 2025
DUAL-STREAM DIFFUSION FOR WORLD-MODEL AUGMENTED VISION-LANGUAGE-ACTION MODEL

Proposes DUST (DUal-STream diffusion), a dual-stream architecture that separately processes vision and action modalities with asynchronous sampling for world modeling.

ICML 2026 Oct 31, 2025
VERIFIER-FREE TEST-TIME SAMPLING FOR VISION LANGUAGE ACTION MODELS

Introduces MG-Select, a training-free test-time scaling method that uses KL divergence from masked reference distribution as confidence metric for action selection.

ICLR 2026 Oct 7, 2025
CONTEXTVLA: VISION-LANGUAGE-ACTION MODEL WITH AMORTIZED MULTI-FRAME CONTEXT

Proposes ContextVLA, which compresses multi-frame temporal context into a single token for efficient processing without computational overhead.

arxiv Oct 5, 2025
CONTRASTIVE REPRESENTATION REGULARIZATION FOR VISION-LANGUAGE-ACTION MODELS

Introduces RS-CL (Robot State-aware Contrastive Loss) that aligns VLA representations with proprioceptive states using relative distances as soft supervision.

ICML 2026 Oct 2, 2025
HAMLET: SWITCH YOUR VISION-LANGUAGEACTION MODEL INTO A HISTORY-AWARE POLICY

Introduces RS-CL (Robot State-aware Contrastive Loss) that aligns VLA representations with proprioceptive states using relative distances as soft supervision.

ICLR 2026 Oct 1, 2025
ALLEX: Where RLWRLD’s Potential Unfolds

In collaboration with RLWRLD, WIRobotics has unveiled ALLEX, a humanoid robot engineered for safe human–robot collaboration and remarkable hand dexterity. The true intelligence from RLWRLD on ALLEX will be unveiled this fall. To meet ALLEX'…

Aug 18, 2025
Combinative Matching for Geometric Shape Assembly

The proposed approach significantly reduces local ambiguities in matching by explicitly modeling both identical surface shapes and opposite volume occupancy, enabling more accurate correspondences and ultimately allowing a robust combinatio…

Aug 13, 2025
A Unified Framework for Motion Reasoning and Generation in Human Interaction

To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversa…

ICCV 2025 Jun 26, 2025
Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments.

ECCV 2026 Jun 13, 2025
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

ROBOT-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations.

NeurIPS 2025 May 29, 2025
Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

We present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce …

May 3, 2025
Target-Aware Video Diffusion Models

Given an input image, our target-aware video diffusion model generates a video in whichan actor accurately interacts with the target, specified with its segmentation mask.

ICLR 2026 Apr 2, 2025
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Given a textual description of the spatial relationship between two objects, our method models OOR, representing their relative poses and scales according to the text. We obtain OOR samples using off-the-shelf models and a proposed mesh reg…

ICCV 2025 Mar 25, 2025

Research

Robot-Factored World Models via Robot Rendering

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Efficient Timing-Aware Planning via Detachable Task Modeling

Natural Functional Gradients for Smooth Trajectory Optimization

TRQAM: Trust Region Q-Adjoint Matching — Stable Off-Policy RL for Flow Policies

MoSS: Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

MOSS: Exploring High-Order Self-Similarity for Video Understanding

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Grasps

VLMPose: Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Heavy Lifting: How Much Heavy Lifting Can an Agent Harness Do? — Measuring the LLM's Residual Role in a Planning Agent

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

GRIT: Learning Dexterous Grasping from Sparse Taxonomy Guidance

VaLR: Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

MoGaF: Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

Improving Text-to-Image Generation with Intrinsic Self-Confi dence Rewards

Dexterous World Models

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

DUAL-STREAM DIFFUSION FOR WORLD-MODEL AUGMENTED VISION-LANGUAGE-ACTION MODEL

VERIFIER-FREE TEST-TIME SAMPLING FOR VISION LANGUAGE ACTION MODELS

CONTEXTVLA: VISION-LANGUAGE-ACTION MODEL WITH AMORTIZED MULTI-FRAME CONTEXT

CONTRASTIVE REPRESENTATION REGULARIZATION FOR VISION-LANGUAGE-ACTION MODELS

HAMLET: SWITCH YOUR VISION-LANGUAGEACTION MODEL INTO A HISTORY-AWARE POLICY

ALLEX: Where RLWRLD’s Potential Unfolds

Combinative Matching for Geometric Shape Assembly

A Unified Framework for Motion Reasoning and Generation in Human Interaction

Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Target-Aware Video Diffusion Models

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models