MusicBrainz20K

Name: MusicBrainz20K
Published: 2017-09-01
License: Creative Commons license

Dataset Information

Languages

English

Introduced

2017

License

Creative Commons license

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.

[1] Hildebrandt, Kai, et al. "Large-scale data pollution with Apache Spark." IEEE Transactions on Big Data 6.2 (2017): 396-411.

Variants: MusicBrainz20K

Associated Benchmarks

This dataset is used in 1 benchmark:

Entity Resolution - Metrics: F1

Recent Benchmark Submissions

No recent benchmark submissions available for this dataset.

Research Papers

No papers with results on this dataset found.

External Links:

MusicBrainz20K

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview