VerilogEval

Name: VerilogEval
Published: 2023-10-28
License: MIT License

Dataset Information

Modalities

Texts

Languages

English

Introduced

2023

License

MIT License

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

VerilogEval Dataset

The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) to generate syntactically correct and functionally accurate Verilog code. Introduced in the paper VerilogEval: Evaluating Large Language Models for Verilog Code Generation, it has become a cornerstone for research in hardware code generation.

Dataset Characteristics

Diverse Problem Set:
The dataset comprises 156 distinct problems sourced from HDLBits, covering a wide range of digital design tasks—from simple combinational circuits to more complex sequential and state-based designs.
Dual Descriptions:
It includes two types of problem statements:
VerilogEval-human: Handcrafted descriptions by experts, closely reflecting real-world design challenges.
VerilogEval-machine: Machine-generated descriptions, typically more verbose, to simulate automated problem formulation.
Automated Evaluation Framework:
Each problem comes with a canonical solution and an evaluation harness that uses simulation (e.g., via Icarus Verilog) to verify the functional correctness of generated code. Evaluation metrics such as pass@k account for the non-deterministic outputs of LLMs.
Rich Evaluation Metrics:
The dataset supports detailed error analysis by classifying failures (e.g., syntax errors, simulation mismatches) and quantifying performance improvements through metrics like pass@1 and pass@5.

Motivations and Content Summary

The primary motivations behind the VerilogEval Dataset are:

Standardized Benchmarking:
To provide a reproducible framework for evaluating and comparing the performance of various LLMs on Verilog code generation tasks.
Advancing Hardware Design Automation:
By focusing on Verilog—a key hardware description language—the dataset encourages research that bridges AI and digital hardware design, ultimately helping to automate and accelerate chip design processes.
Facilitating Model Improvement:
The dataset’s detailed error classifications and pass rate metrics help pinpoint specific weaknesses in LLM-generated code, guiding future research in prompt engineering, in-context learning, and fine-tuning methods for improved performance.

Potential Use Cases

Benchmarking and Comparison:
Researchers can use the dataset to measure and compare the performance of different LLMs (e.g., GPT-4, CodeGen, etc.) in generating correct Verilog code.
Prompt Engineering Research:
The dataset allows exploration of the effects of prompt tuning and in-context learning on the quality of generated hardware description language code.
Fine-Tuning and Domain Adaptation:
It serves as an excellent resource for supervised fine-tuning, enabling models to adapt better to the nuances of Verilog and hardware design tasks.
Educational Resource:
Educators and students can leverage the dataset to practice Verilog coding, test design understanding, and learn automated testing techniques in digital design courses.
EDA Tool Integration:
The dataset can be integrated into Electronic Design Automation (EDA) workflows for automatic code verification, debugging, and performance analysis.

Example Evaluation Metric

An example of a metric used in the evaluation is the pass rate, defined as:

[
\text{Pass Rate} = \frac{\text{Number of samples that pass functional tests}}{n}
]

This metric quantifies the reliability of generated code by considering the non-deterministic nature of LLM outputs, where multiple samples are generated per problem.

Variants: VerilogEval

Associated Benchmarks

This dataset is used in 1 benchmark:

Code Generation - Metrics: Pass Rate

Recent Benchmark Submissions

Task	Model	Paper	Date
Code Generation	Nexus (Claude 3.5 Sonnet)	Nexus: A Lightweight and Scalable …	2025-02-26

Research Papers

Recent papers with results on this dataset:

Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation (2025) -

External Links: