GUM

Georgetown University Multilayer corpus

Dataset Information
Modalities
Texts, Speech
Languages
English
License
Homepage

Overview

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

  • Multiple POS tags, morphological features and lemmatization
  • Sentence segmentation and rough speech act
  • Document structure in TEI XML (paragraphs, headings, figures, etc.)
  • ISO date/time annotations
  • Speaker and addressee information (where relevant)
  • Constituent and dependency syntax
  • Information status (given, accessible, new, split antecedent)
  • Entity and coreference annotation, including bridging anaphora
  • Entity linking (Wikification)
  • Discourse parses in Rhetorical Structure Theory and discourse dependencies

Variants: GUM

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Entity Linking baseline WikiGUM: Exhaustive Entity Linking for … 2021-09-15

Research Papers

Recent papers with results on this dataset: