Venue
International Conference on Machine Learning
Domain
Artificial Intelligence, Natural Language Processing, Computer Vision
The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering.In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website.We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.We evaluate on the recent MIND2WEB benchmark.In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites.We show that GPT-4V presents a great potential for web agents-it can successfully complete 51.1% of the tasks on live websites if we manually ground its textual plans into actions on the websites.This substantially outperforms textonly LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents.However, grounding still remains a major challenge.Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals.Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement.All code, data, and evaluation tools are available at https: //github.com/OSU-NLP-Group/SeeAct.
The paper presents SEEACT, a generalist web agent leveraging GPT-4V(ision) to perform web tasks by integrating visual understanding and action execution. It highlights a significant advancement in large multimodal models (LMMs) like GPT-4V and Gemini, showcasing their potential to navigate and interact with various websites based on natural language instructions. The authors evaluate SEEACT using the MIND2WEB dataset, detailing an online evaluation tool that assesses web agent performance on live websites compared to traditional cached evaluations. The results demonstrate SEEACT's effectiveness, achieving a 51.1% task completion rate with oracle grounding, significantly surpassing text-only models. The study discusses challenges in grounding actions to HTML elements and proposes multiple strategies for grounding effectiveness. It concludes by emphasizing the need for better grounding methods and the discrepancies observed between online and offline evaluations.
This paper employs the following methods:
- SEEACT
- Grounding via Element Attributes
- Grounding via Textual Choices
- Grounding via Image Annotation
- GPT-4V
- FLAN-T5
- BLIP-2
- LLaVA-1.5
The following datasets were used in this research:
- Element Accuracy
- Operation F1
- Step Success Rate
- Success Rate
- SEEACT with GPT-4V achieved a 51.1% success rate on live websites with oracle grounding
- Best grounding strategy leveraging both HTML and visuals showed substantial improvement
- GPT-4V outperformed text-only models in web interaction tasks
The authors identified the following limitations:
- Number of GPUs: None specified
- GPU Type: None specified
Large Multimodal Models
Web Agents
Grounding Strategies
Visual Understanding