MusicBrainz20K

Dataset Information
Languages
English
Introduced
2017
Homepage

Overview

The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.

[1] Hildebrandt, Kai, et al. "Large-scale data pollution with Apache Spark." IEEE Transactions on Big Data 6.2 (2017): 396-411.

Variants: MusicBrainz20K

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

No recent benchmark submissions available for this dataset.

Research Papers

No papers with results on this dataset found.