GUM

Georgetown University Multilayer corpus

Dataset Information

Modalities

Texts, Speech

Languages

English

License

CC-BY-NC-SA

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

Multiple POS tags, morphological features and lemmatization
Sentence segmentation and rough speech act
Document structure in TEI XML (paragraphs, headings, figures, etc.)
ISO date/time annotations
Speaker and addressee information (where relevant)
Constituent and dependency syntax
Information status (given, accessible, new, split antecedent)
Entity and coreference annotation, including bridging anaphora
Entity linking (Wikification)
Discourse parses in Rhetorical Structure Theory and discourse dependencies

Variants: GUM

Associated Benchmarks

This dataset is used in 1 benchmark:

Entity Linking - Metrics: F1

Recent Benchmark Submissions

Task	Model	Paper	Date
Entity Linking	baseline	WikiGUM: Exhaustive Entity Linking for …	2021-09-15

Research Papers

Recent papers with results on this dataset:

WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres (2021) -

External Links: