VerilogEval

Dataset Information
Modalities
Texts
Languages
English
Introduced
2023
License
Homepage

Overview

VerilogEval Dataset

The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) to generate syntactically correct and functionally accurate Verilog code. Introduced in the paper VerilogEval: Evaluating Large Language Models for Verilog Code Generation, it has become a cornerstone for research in hardware code generation.


Dataset Characteristics

  • Diverse Problem Set:
    The dataset comprises 156 distinct problems sourced from HDLBits, covering a wide range of digital design tasks—from simple combinational circuits to more complex sequential and state-based designs.

  • Dual Descriptions:
    It includes two types of problem statements:

  • VerilogEval-human: Handcrafted descriptions by experts, closely reflecting real-world design challenges.
  • VerilogEval-machine: Machine-generated descriptions, typically more verbose, to simulate automated problem formulation.

  • Automated Evaluation Framework:
    Each problem comes with a canonical solution and an evaluation harness that uses simulation (e.g., via Icarus Verilog) to verify the functional correctness of generated code. Evaluation metrics such as pass@k account for the non-deterministic outputs of LLMs.

  • Rich Evaluation Metrics:
    The dataset supports detailed error analysis by classifying failures (e.g., syntax errors, simulation mismatches) and quantifying performance improvements through metrics like pass@1 and pass@5.


Motivations and Content Summary

The primary motivations behind the VerilogEval Dataset are:

  • Standardized Benchmarking:
    To provide a reproducible framework for evaluating and comparing the performance of various LLMs on Verilog code generation tasks.

  • Advancing Hardware Design Automation:
    By focusing on Verilog—a key hardware description language—the dataset encourages research that bridges AI and digital hardware design, ultimately helping to automate and accelerate chip design processes.

  • Facilitating Model Improvement:
    The dataset’s detailed error classifications and pass rate metrics help pinpoint specific weaknesses in LLM-generated code, guiding future research in prompt engineering, in-context learning, and fine-tuning methods for improved performance.


Potential Use Cases

  • Benchmarking and Comparison:
    Researchers can use the dataset to measure and compare the performance of different LLMs (e.g., GPT-4, CodeGen, etc.) in generating correct Verilog code.

  • Prompt Engineering Research:
    The dataset allows exploration of the effects of prompt tuning and in-context learning on the quality of generated hardware description language code.

  • Fine-Tuning and Domain Adaptation:
    It serves as an excellent resource for supervised fine-tuning, enabling models to adapt better to the nuances of Verilog and hardware design tasks.

  • Educational Resource:
    Educators and students can leverage the dataset to practice Verilog coding, test design understanding, and learn automated testing techniques in digital design courses.

  • EDA Tool Integration:
    The dataset can be integrated into Electronic Design Automation (EDA) workflows for automatic code verification, debugging, and performance analysis.


Example Evaluation Metric

An example of a metric used in the evaluation is the pass rate, defined as:

[
\text{Pass Rate} = \frac{\text{Number of samples that pass functional tests}}{n}
]

This metric quantifies the reliability of generated code by considering the non-deterministic nature of LLM outputs, where multiple samples are generated per problem.

Variants: VerilogEval

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Code Generation Nexus (Claude 3.5 Sonnet) Nexus: A Lightweight and Scalable … 2025-02-26

Research Papers

Recent papers with results on this dataset: