← ML Research Wiki / 2506.17221

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

(2025)

Paper Information

arXiv ID

2506.17221

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections.We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1.To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, i.e., Habitat, and propose Long-Short Memory Sampling to balance historical and current observations.While large language models can supervise complete textual instructions, they lack fine-grained action-level control.Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions.Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark.VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.Preprint.Under review.

Summary

The paper presents VLN-R1, an innovative framework for Vision-Language Navigation (VLN) that utilizes Large Vision-Language Models (LVLMs) to enable real-time navigation in continuous 3D environments using natural language instructions. The main contribution includes the introduction of the VLN-Ego dataset, which consists of egocentric video streams paired with action predictions. The framework employs a two-stage training methodology: Supervised Fine-Tuning (SFT) to align action predictions with expert demonstrations, followed by Reinforcement Fine-Tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism to improve long-horizon navigation performance. VLN-R1 overcomes limitations of previous graph-based methods by eliminating predefined navigation paths, demonstrating strong performance on benchmarks like VLN-CE. Key mechanisms involve a Long-Short Memory Sampling strategy for balancing historical and real-time observations, and RFT for optimizing navigation tasks based on reward signaling. The results indicate state-of-the-art performance in VLN tasks, showcasing the potential of LVLMs for embodied AI applications.

Methods

This paper employs the following methods:

Reinforcement Learning
Supervised Fine-Tuning
Reinforcement Fine-Tuning
Group Relative Policy Optimization (GRPO)
Long-Short Memory Sampling

Models Used

Qwen2-VL-2B
Qwen2-VL-7B

Datasets

The following datasets were used in this research:

VLN-Ego
R2R
RxR

Evaluation Metrics

Success Rate (SR)
Oracle Success Rate (OS)
Success weighted by Path Length (SPL)
Navigation Error (NE)
Trajectory Length (TL)

Results

VLN-R1 achieves state-of-the-art performance on VLN-CE benchmark
Effective cross-domain adaptation with minimal data
Demonstrated effectiveness of RFT in enhancing smaller models

Limitations

The authors identified the following limitations:

Evaluation limited to simulated indoor environments
Restricted fine-grained control due to discrete action space

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA A800
Compute Requirements: SFT: global batch size of 64, per-GPU batch size of 2, completes 1 epoch in 36 hours; RFT: per-GPU batch size of 1, completes training in about 12 hours per epoch.

Papers Using Similar Methods

External Resources

References: 63

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers