← ML Research Wiki / 2506.17196

Detecting LLM-Generated Short Answers and Effects on Learner Performance

(2025)

Paper Information

arXiv ID

2506.17196

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

The increasing availability of large language models (LLMs) has raised concerns about their potential misuse in online learning.While tools for detecting LLM-generated text exist and are widely used by researchers and educators, their reliability varies.Few studies have compared the accuracy of detection methods, defined criteria to identify content generated by LLM, or evaluated the effect on learner performance from LLM misuse within learning.In this study, we define LLMgenerated text within open responses as those produced by any LLM without paraphrasing or refinement, as evaluated by human coders.We then fine-tune GPT-4o to detect LLM-generated responses and assess the impact on learning from LLM misuse.We find that our fine-tuned LLM outperforms the existing AI detection tool GPTZero, achieving an accuracy of 80% and an F1 score of 0.78, compared to GPTZero's accuracy of 70% and macro F1 score of 0.50, demonstrating superior performance in detecting LLM-generated responses.We also find that learners suspected of LLM misuse in the open response question were more than twice as likely to correctly answer the corresponding posttest MCQ, suggesting potential misuse across both question types and indicating a bypass of the learning process.We pave the way for future work by demonstrating a structured, code-based approach to improve LLM-generated response detection and propose using auxiliary statistical indicators such as unusually high assessment scores on related tasks, readability scores, and response duration.In support of open science, we contribute data and code to support the fine-tuning of similar models for similar use cases.

Summary

This study investigates the reliability of detecting LLM-generated short answers and the subsequent effects on learner performance in online educational settings. It defines LLM-generated text based on criteria established by human coders and introduces a fine-tuned model of GPT-4o to compare its detection accuracy against existing tools like GPTZero. The fine-tuned model achieved 80% accuracy and an F1 score of 0.78, outperforming GPTZero, which recorded 70% accuracy and an F1 score of 0.50. The research also found that learners suspected of misusing LLMs demonstrated significantly higher performance on posttest multiple-choice questions, indicating that reliance on LLMs could circumvent genuine learning processes. The study contributes a structured annotation rubric and emphasizes the need for reliable detection methods in educational contexts, warning against the potential negative impact of AI misuse on learner engagement and integrity.

Methods

This paper employs the following methods:

GPT-4o
Logistic Regression
Random Forest

Models Used

GPT-4o
GPTZero

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

Accuracy
F1-score

Results

Fine-tuned GPT-4o model outperforms GPTZero with 80% accuracy and F1 score of 0.78
Learners suspected of LLM use more likely to correctly answer posttest MCQs with odds ratio of 2.37

Detecting LLM-Generated Short Answers and Effects on Learner Performance

Abstract

Summary

Methods

Models Used

Datasets

Evaluation Metrics

Results

Technical Requirements

Papers Using Similar Methods

External Resources

Detecting LLM-Generated Short Answers and Effects on Learner Performance

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Related Papers

Papers Using Similar Methods

External Resources

Edit Paper Information

Abstract

Methods

Models Used

Datasets

Evaluation Metrics

Results

Technical Requirements