← ML Research Wiki / 2506.17144

Do We Need Large VLMs for Spotting Soccer Actions?

(2025)

Paper Information
arXiv ID

Abstract

Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data.In this work, we propose a shift from this videocentric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs).We posit that expert commentary, which provides rich, finegrained descriptions and contextual cues such as excitement and tactical insights, contains enough information to reliably spot key actions in a match.To demonstrate this, we use the SoccerNet Echoes dataset, which provides timestamped commentary, and employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics.Each LLM evaluates sliding windows of commentary to identify actions like goals, cards, and substitutions, generating accurate timestamps for these events.Our experiments show that this language-centric approach performs effectively in detecting critical match events, providing a lightweight and training-free alternative to traditional video-based methods for action spotting.Here is Coman.I wonder if the referee's going to book him for that because that looked like a blatant dive really."

Summary

This paper proposes a novel approach to action spotting in soccer by leveraging expert commentary instead of traditional video inputs. The authors argue that large language models (LLMs) can effectively parse detailed commentary to identify key actions such as goals, cards, and substitutions. They employ the SoccerNet Echoes dataset, which includes timestamped commentary, and a system of three specialized LLMs that judge the commentary based on outcome, excitement, and tactical aspects. The study demonstrates that this language-centric method provides a competitive alternative to video-based approaches, achieving notable improvements in action detection metrics when compared to existing methods. The paper raises questions about the necessity of visual input for action spotting based on the richness of commentary data and outlines future directions for research in this area.

Methods

This paper employs the following methods:

  • Large Language Models (LLMs)
  • three-LLM system

Models Used

  • Llama 3.1 8B

Datasets

The following datasets were used in this research:

  • SoccerNet Echoes

Evaluation Metrics

  • mean Average Precision (mAP)
  • tight mAP

Results

  • achieves an average mAP of 64.50%
  • tight average mAP of 60.75%
  • outperforming the SoccerNet baseline by 14.80% in average mAP
  • surpasses RMS-Net by 11.10% in average mAP

Limitations

The authors identified the following limitations:

  • Dependent on the quality of commentary and transcription
  • Errors in transcription can affect results
  • Assumes commentary is sufficiently detailed for action spotting

Technical Requirements

  • Number of GPUs: 1
  • GPU Type: RTX 3060
  • Compute Requirements: None specified

Papers Using Similar Methods

External Resources