← ML Research Wiki / 2401.01614

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su (2024)

Paper Information

arXiv ID

2401.01614

Venue

International Conference on Machine Learning

Domain

Artificial Intelligence, Natural Language Processing, Computer Vision

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering.In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website.We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.We evaluate on the recent MIND2WEB benchmark.In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites.We show that GPT-4V presents a great potential for web agents-it can successfully complete 51.1% of the tasks on live websites if we manually ground its textual plans into actions on the websites.This substantially outperforms textonly LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents.However, grounding still remains a major challenge.Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals.Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement.All code, data, and evaluation tools are available at https: //github.com/OSU-NLP-Group/SeeAct.

Summary

The paper presents SEEACT, a generalist web agent leveraging GPT-4V(ision) to perform web tasks by integrating visual understanding and action execution. It highlights a significant advancement in large multimodal models (LMMs) like GPT-4V and Gemini, showcasing their potential to navigate and interact with various websites based on natural language instructions. The authors evaluate SEEACT using the MIND2WEB dataset, detailing an online evaluation tool that assesses web agent performance on live websites compared to traditional cached evaluations. The results demonstrate SEEACT's effectiveness, achieving a 51.1% task completion rate with oracle grounding, significantly surpassing text-only models. The study discusses challenges in grounding actions to HTML elements and proposes multiple strategies for grounding effectiveness. It concludes by emphasizing the need for better grounding methods and the discrepancies observed between online and offline evaluations.

Methods

This paper employs the following methods:

SEEACT
Grounding via Element Attributes
Grounding via Textual Choices
Grounding via Image Annotation

Models Used

GPT-4V
FLAN-T5
BLIP-2
LLaVA-1.5

Datasets

The following datasets were used in this research:

MIND2WEB

Evaluation Metrics

Element Accuracy
Operation F1
Step Success Rate
Success Rate

Results

SEEACT with GPT-4V achieved a 51.1% success rate on live websites with oracle grounding
Best grounding strategy leveraging both HTML and visuals showed substantial improvement
GPT-4V outperformed text-only models in web interaction tasks

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Large Multimodal Models Web Agents Grounding Strategies Visual Understanding

External Resources

Funding: ARL W911NF2220144, Cisco
References: 67
Influential Citations: 19

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers