Insight

Research

Papers and technical reports from the RLWRLD team.

Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

Cog3DMap constructs an explicit 3D cognitive map from multi-view images, grounding each token in 3D space with both semantic and geometric information, enabling MLLMs to directly reason over spatially structured representations for SOTA spa…

arxiv Apr 29, 2026
RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

RoboAlign uses RL-based reasoning alignment to bridge the language-action modality gap in VLAs, achieving up to 106.6% improvement over SFT baselines on real-world robotics tasks with less than 1% additional data.

arxiv Apr 29, 2026
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

SpatialBoost enhances the visual representation through 3D information with linguistic format.

arxiv Apr 29, 2026
RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Neural trajectory curation framework that diversifies synthetic robot data via controllable video generation (I2I/V2V) and filters low-quality samples by comparing motion consistency between generated video and simulator replay

arxiv Apr 29, 2026
Affostruction: 3D Affordance Grounding with Generative Reconstruction

Generative reconstruction to complete occluded regions and ground affordances on full 3D shapes

CVPR 2026 Apr 29, 2026
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

CompACT compresses each observation into just 8 discrete tokens, enabling orders-of-magnitude faster planning in latent world models

CVPR 2026 Apr 29, 2026
MoGaF: Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

Long-term stable scene forecasting via motion-aware Gaussian grouping

CVPR 2026 Apr 29, 2026
Improving Text-to-Image Generation with Intrinsic Self-Confi dence Rewards

Post-training T2I generators with the model's own self-confidence as reward, improving compositionality and text-image alignment without external reward models

CVPR 2026 Apr 29, 2026
Dexterous World Models

Proposes DWM (Dexterous World Model), a video diffusion framework that generates temporally coherent human-scene interaction videos by conditioning on static 3D scene renderings and egocentric hand mesh sequences.

CVPR 2026 Apr 29, 2026
Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Arena-Lite introduces a tournament-based direct comparison approach for LLM evaluation that eliminates the need for baseline outputs. By using head-to-head comparisons between systems, it achieves higher reliability with fewer comparisons t…

EMNLP 2025 Apr 29, 2026
DUAL-STREAM DIFFUSION FOR WORLD-MODEL AUGMENTED VISION-LANGUAGE-ACTION MODEL

Proposes DUST (DUal-STream diffusion), a dual-stream architecture that separately processes vision and action modalities with asynchronous sampling for world modeling.

arxiv Apr 29, 2026
VERIFIER-FREE TEST-TIME SAMPLING FOR VISION LANGUAGE ACTION MODELS

Introduces MG-Select, a training-free test-time scaling method that uses KL divergence from masked reference distribution as confidence metric for action selection.

ICLR 2026 Apr 29, 2026
CONTEXTVLA: VISION-LANGUAGE-ACTION MODEL WITH AMORTIZED MULTI-FRAME CONTEXT

Proposes ContextVLA, which compresses multi-frame temporal context into a single token for efficient processing without computational overhead.

arxiv Apr 29, 2026
CONTRASTIVE REPRESENTATION REGULARIZATION FOR VISION-LANGUAGE-ACTION MODELS

Introduces RS-CL (Robot State-aware Contrastive Loss) that aligns VLA representations with proprioceptive states using relative distances as soft supervision.

arxiv Apr 29, 2026
HAMLET: SWITCH YOUR VISION-LANGUAGEACTION MODEL INTO A HISTORY-AWARE POLICY

Introduces RS-CL (Robot State-aware Contrastive Loss) that aligns VLA representations with proprioceptive states using relative distances as soft supervision.

ICLR 2026 Apr 29, 2026
ALLEX: Where RLWRLD’s Potential Unfolds

In collaboration with RLWRLD, WIRobotics has unveiled ALLEX, a humanoid robot engineered for safe human–robot collaboration and remarkable hand dexterity. The true intelligence from RLWRLD on ALLEX will be unveiled this fall. To meet ALLEX'…

Apr 29, 2026
Combinative Matching for Geometric Shape Assembly

The proposed approach significantly reduces local ambiguities in matching by explicitly modeling both identical surface shapes and opposite volume occupancy, enabling more accurate correspondences and ultimately allowing a robust combinatio…

Apr 29, 2026
A Unified Framework for Motion Reasoning and Generation in Human Interaction

To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversa…

ICCV 2025 Apr 29, 2026
Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments.

Apr 29, 2026
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

ROBOT-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations.

NeurIPS 2025 Apr 29, 2026
Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

We present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce …

Apr 29, 2026
Target-Aware Video Diffusion Models

Given an input image, our target-aware video diffusion model generates a video in whichan actor accurately interacts with the target, specified with its segmentation mask.

ICLR 2026 Apr 29, 2026
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Given a textual description of the spatial relationship between two objects, our method models OOR, representing their relative poses and scales according to the text. We obtain OOR samples using off-the-shelf models and a proposed mesh reg…

ICCV 2025 Apr 29, 2026

Research

Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

MoGaF: Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

Improving Text-to-Image Generation with Intrinsic Self-Confi dence Rewards

Dexterous World Models

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

DUAL-STREAM DIFFUSION FOR WORLD-MODEL AUGMENTED VISION-LANGUAGE-ACTION MODEL

VERIFIER-FREE TEST-TIME SAMPLING FOR VISION LANGUAGE ACTION MODELS

CONTEXTVLA: VISION-LANGUAGE-ACTION MODEL WITH AMORTIZED MULTI-FRAME CONTEXT

CONTRASTIVE REPRESENTATION REGULARIZATION FOR VISION-LANGUAGE-ACTION MODELS

HAMLET: SWITCH YOUR VISION-LANGUAGEACTION MODEL INTO A HISTORY-AWARE POLICY

ALLEX: Where RLWRLD’s Potential Unfolds

Combinative Matching for Geometric Shape Assembly

A Unified Framework for Motion Reasoning and Generation in Human Interaction

Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Target-Aware Video Diffusion Models

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models