Shellcode_IA32

Dataset Information
Languages
English
Introduced
2021
License
Unknown
Homepage

Overview

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes in assembly available to date.

This dataset consists of 3,200 examples of instructions in assembly language for IA-32 (the 32-bit version of the x86 Intel Architecture) from publicly available security exploits. We collected assembly programs used to generate shellcode from exploit-db and from shell-storm. We enriched the dataset by adding examples of assembly programs for the IA-32 architecture from popular tutorials and books. This allowed us to understand how different authors and assembly experts comment and, thus, how to deal with the ambiguity of natural language in this specific context. Our dataset consists of 10% of instructions collected from books and guidelines, and the rest from real shellcodes.

Our focus is on Linux, the most common OS for security-critical network services. Accordingly, we added assembly instructions written with Netwide Assembler (NASM) for Linux.

Each line of Shellcode_IA32 dataset represents a snippet - intent pair. The snippet is a line or a combination of multiple lines of assembly code, built by following the NASM syntax. The intent is a comment in the English language.

Further statistics on the dataset and a set of preliminary experiments performed with a neural machine translation (NMT) model are described in the following paper: Shellcode_IA32: A Dataset for Automatic Shellcode Generation.

Variants: Shellcode_IA32

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Code Generation CodeBERT Can We Generate Shellcodes via … 2022-02-08
Code Generation Seq2Seq with Attention Can We Generate Shellcodes via … 2022-02-08
Code Generation LSTM-based Sequence to Sequence Shellcode_IA32: A Dataset for Automatic … 2021-04-27

Research Papers

Recent papers with results on this dataset: