Navigation auf uzh.ch

Suche

Center of Competence for Sustainable Finance

Climatext*

Climatext is a dataset for sentence-based climate change topic detection. The dataset explores different approaches to identify the climate change topic in various text sources and is based on the paper ClimaText: A Dataset for Climate Change Topic Detection by Francesco S. Varini, Jordan Boyd-Graber, Massimiliano Ciaramita, Markus Leippold.

*Accepted for the Tackling Climate Change with Machine Learning workshop at NeurIPS 2020.
 

Download Dataset
Access paper on arXiv.org
Direct download PDF (PDF, 289 KB)

Please cite the following paper for the use of the dataset:
Francesco S. Varini and Jordan Boyd-Graber and Massimiliano Ciaramita and Markus Leippold (2020). ClimaText: A Dataset for Climate Change Topic Detection, In: Tackling Climate Change with Machine Learning workshop at NeurIPS 2020, Online, 11 December 2020 - 11 December 2020.

DESCRIPTION

The data set is composed of different tab-separated-values (tsv) files. Each tsv file contains at least four columns : "id", "label", "title", "sentence". Optionally, also a "paragraph" column.

The "label" can be either -1 (unlabeled), 0 (negative), 1 (positive), where positive means that the sentence talks about climate change and negative that it does not. 
The "title" can be either the title of a document or the link to a webpage from which the sentence was taken.

The "paragraph" can be -1 (unspecified) or a positive integer number which represents the paragraph in the text indexed in ascending order where the sentence was taken.

Document-labeled sentences
train-data\Wiki-Doc-Train.tsv, dev-data\Wiki-Doc-Dev.tsv, test-data\Wiki-Doc-Test.tsv contain document-labeled sentences

Unlabeled sentences
train-data\10-Ks (2014) unlabeled.tsv, test-data\10-Ks (2018) unlabeled.tsv

Human-labeled sentences
train-data\AL-10Ks.tsv, train-data\AL-Wiki.tsv, dev-data\Wikipedia (dev).tsv, test-data\Claims (test).tsv, test-data\Wiki-Doc-Test.tsv, test-data\Wikipedia (test).tsv

Human-labeled sentences, other than the Claims, come from the bigger Document-Labeled or Unlabeled data sets (can be mapped through the "id").
For more information on the data please read the paper "ClimaText: A Dataset for Climate Change Topic Detection".