← ML Research Wiki / 2506.17203

CONFIDENCE SCORING FOR LLM-GENERATED SQL IN SUPPLY CHAIN DATA EXTRACTION

(2025)

Paper Information

arXiv ID

2506.17203

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large Language Models (LLMs) have recently enabled natural language interfaces that translate user queries into executable SQL, offering a powerful solution for non-technical stakeholders to access structured data.However, one of the limitation that LLMs do not natively express uncertainty makes it difficult to assess the reliability of their generated queries.This paper presents a case study that evaluates multiple approaches to estimate confidence scores for LLM-generated SQL in supply chain data retrieval.We investigated three strategies: (1) translation-based consistency checks; (2) embedding-based semantic similarity between user questions and generated SQL; and (3) self-reported confidence scores directly produced by the LLM.Our findings reveal that LLMs are often overconfident in their own outputs, which limits the effectiveness of self-reported confidence.In contrast, embedding-based similarity methods demonstrate strong discriminative power in identifying inaccurate SQL.

Summary

This paper investigates confidence scoring methods for SQL queries generated by Large Language Models (LLMs) in the context of supply chain data extraction. It identifies the trust issues associated with LLMs due to their inability to express uncertainty, particularly when translating natural language queries into SQL. The study evaluates three confidence scoring strategies: translation-based consistency checks, embedding-based semantic similarity, and self-reported confidence scores from the LLM. Experimental results show that LLMs tend to be overconfident in their outputs, especially for simpler questions. The embedding-based similarity method demonstrated the highest discriminative performance in identifying incorrect SQL queries, indicating its potential for improving user trust in LLM-generated outputs. Future work will focus on validating these methods across more diverse datasets and enhancing the robustness of the confidence scoring framework.

Methods

This paper employs the following methods:

Translation-based consistency checks
Embedding-based semantic similarity
Self-reported confidence

Models Used

Claude Sonnet 3

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

AUROC

Results

LLMs often overestimate their confidence in generated SQL queries
Embedding similarity methods outperform self-reported confidence in reliability
The similarity score serves as a reliable proxy for SQL correctness

Limitations

The authors identified the following limitations:

Experiments conducted on a synthetic dataset due to privacy concerns
Limited generalizability of findings to real-world complexities

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified
Compute Requirements: None specified

External Resources

References: 12

CONFIDENCE SCORING FOR LLM-GENERATED SQL IN SUPPLY CHAIN DATA EXTRACTION

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers