Climatext

Climatext is a dataset for sentence-based climate change topic detection. The dataset explores different approaches to identify the climate change topic in various text sources and is based on the paper ClimaText: A Dataset for Climate Change Topic Detection by Francesco S. Varini, Jordan Boyd-Graber, Massimiliano Ciaramita, Markus Leippold.

*Accepted for the Tackling Climate Change with Machine Learning workshop at NeurIPS 2020.

Download Dataset
Access paper on arXiv.org
Direct download PDF (PDF, 289 KB)

Please cite the following paper for the use of the dataset:
Francesco S. Varini and Jordan Boyd-Graber and Massimiliano Ciaramita and Markus Leippold (2020). ClimaText: A Dataset for Climate Change Topic Detection, In: Tackling Climate Change with Machine Learning workshop at NeurIPS 2020, Online, 11 December 2020 - 11 December 2020.

DESCRIPTION

The data set is composed of different tab-separated-values (tsv) files. Each tsv file contains at least four columns : "id", "label", "title", "sentence". Optionally, also a "paragraph" column.

The "label" can be either -1 (unlabeled), 0 (negative), 1 (positive), where positive means that the sentence talks about climate change and negative that it does not.
The "title" can be either the title of a document or the link to a webpage from which the sentence was taken.

The "paragraph" can be -1 (unspecified) or a positive integer number which represents the paragraph in the text indexed in ascending order where the sentence was taken.

TRAIN DATA

train-data\10-Ks (2014) unlabeled.tsv : 568504 (unlabeled) (TSV, 124 MB)
train-data\Wiki-Doc-Train.tsv : 115847 (57922 positives, 57925 negatives) (TSV, 18 MB)
train-data\AL-10Ks.tsv : 3000 (58 positives, 2942 negatives) (TSV, 521 KB)
train-data\AL-Wiki.tsv : 3000 (261 positives, 2739 negatives) (TSV, 440 KB)

DEVELOPMENT DATA

dev-data\Wiki-Doc-Dev.tsv : 3618 (1809 positives, 1809 negatives) (TSV, 588 KB)
dev-data\Wikipedia (dev).tsv : 300 (82 positives, 218 negatives) (TSV, 52 KB)

TEST DATA

test-data\10-Ks (2018) unlabeled.tsv : 1266245 (unlabeled)
test-data\10-Ks (2018, test).tsv : 300 (67 positives, 233 negatives) (TSV, 71 KB)
test-data\Claims (test).tsv : 1000 (500 positives, 500 negatives) (TSV, 225 KB)
test-data\Wiki-Doc-Test.tsv : 3826 (1913 positives, 1913 negatives) (TSV, 613 KB)
test-data\Wikipedia (test).tsv : 300 (33 positives, 267 negatives) (TSV, 50 KB)

Document-labeled sentences
train-data\Wiki-Doc-Train.tsv, dev-data\Wiki-Doc-Dev.tsv, test-data\Wiki-Doc-Test.tsv contain document-labeled sentences

Unlabeled sentences
train-data\10-Ks (2014) unlabeled.tsv, test-data\10-Ks (2018) unlabeled.tsv

Human-labeled sentences
train-data\AL-10Ks.tsv, train-data\AL-Wiki.tsv, dev-data\Wikipedia (dev).tsv, test-data\Claims (test).tsv, test-data\Wiki-Doc-Test.tsv, test-data\Wikipedia (test).tsv

Human-labeled sentences, other than the Claims, come from the bigger Document-Labeled or Unlabeled data sets (can be mapped through the "id").
For more information on the data please read the paper "ClimaText: A Dataset for Climate Change Topic Detection".

Center of Competence for Sustainable Finance

Quicklinks und Sprachwechsel

Main navigation

Climatext*

DESCRIPTION

TRAIN DATA

DEVELOPMENT DATA

TEST DATA