← ML Research Wiki / 2506.17203

CONFIDENCE SCORING FOR LLM-GENERATED SQL IN SUPPLY CHAIN DATA EXTRACTION

(2025)

Paper Information
arXiv ID

Abstract

Large Language Models (LLMs) have recently enabled natural language interfaces that translate user queries into executable SQL, offering a powerful solution for non-technical stakeholders to access structured data.However, one of the limitation that LLMs do not natively express uncertainty makes it difficult to assess the reliability of their generated queries.This paper presents a case study that evaluates multiple approaches to estimate confidence scores for LLM-generated SQL in supply chain data retrieval.We investigated three strategies: (1) translation-based consistency checks; (2) embedding-based semantic similarity between user questions and generated SQL; and (3) self-reported confidence scores directly produced by the LLM.Our findings reveal that LLMs are often overconfident in their own outputs, which limits the effectiveness of self-reported confidence.In contrast, embedding-based similarity methods demonstrate strong discriminative power in identifying inaccurate SQL.

Summary

This paper investigates confidence scoring methods for SQL queries generated by Large Language Models (LLMs) in the context of supply chain data extraction. It identifies the trust issues associated with LLMs due to their inability to express uncertainty, particularly when translating natural language queries into SQL. The study evaluates three confidence scoring strategies: translation-based consistency checks, embedding-based semantic similarity, and self-reported confidence scores from the LLM. Experimental results show that LLMs tend to be overconfident in their outputs, especially for simpler questions. The embedding-based similarity method demonstrated the highest discriminative performance in identifying incorrect SQL queries, indicating its potential for improving user trust in LLM-generated outputs. Future work will focus on validating these methods across more diverse datasets and enhancing the robustness of the confidence scoring framework.

Methods

This paper employs the following methods:

  • Translation-based consistency checks
  • Embedding-based semantic similarity
  • Self-reported confidence

Models Used

  • Claude Sonnet 3

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • AUROC

Results

  • LLMs often overestimate their confidence in generated SQL queries
  • Embedding similarity methods outperform self-reported confidence in reliability
  • The similarity score serves as a reliable proxy for SQL correctness

Limitations

The authors identified the following limitations:

  • Experiments conducted on a synthetic dataset due to privacy concerns
  • Limited generalizability of findings to real-world complexities

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified
  • Compute Requirements: None specified

External Resources