Climatext is a dataset for sentence-based climate change topic detection. The dataset explores different approaches to identify the climate change topic in various text sources and is based on the paper ClimaText: A Dataset for Climate Change Topic Detection by Francesco S. Varini, Jordan Boyd-Graber, Massimiliano Ciaramita, Markus Leippold.
*Accepted for the Tackling Climate Change with Machine Learning workshop at NeurIPS 2020.
Please cite the following paper for the use of the dataset:
Francesco S. Varini and Jordan Boyd-Graber and Massimiliano Ciaramita and Markus Leippold (2020). ClimaText: A Dataset for Climate Change Topic Detection, In: Tackling Climate Change with Machine Learning workshop at NeurIPS 2020, Online, 11 December 2020 - 11 December 2020.
The data set is composed of different tab-separated-values (tsv) files. Each tsv file contains at least four columns : "id", "label", "title", "sentence". Optionally, also a "paragraph" column.
The "label" can be either -1 (unlabeled), 0 (negative), 1 (positive), where positive means that the sentence talks about climate change and negative that it does not.
The "title" can be either the title of a document or the link to a webpage from which the sentence was taken.
The "paragraph" can be -1 (unspecified) or a positive integer number which represents the paragraph in the text indexed in ascending order where the sentence was taken.
train-data\10-Ks (2014) unlabeled.tsv : 568504 (unlabeled) (TSV, 124 MB)
train-data\Wiki-Doc-Train.tsv : 115847 (57922 positives, 57925 negatives) (TSV, 18 MB)
train-data\AL-10Ks.tsv : 3000 (58 positives, 2942 negatives) (TSV, 521 KB)
train-data\AL-Wiki.tsv : 3000 (261 positives, 2739 negatives) (TSV, 440 KB)
test-data\10-Ks (2018) unlabeled.tsv : 1266245 (unlabeled)
test-data\10-Ks (2018, test).tsv : 300 (67 positives, 233 negatives) (TSV, 71 KB)
test-data\Claims (test).tsv : 1000 (500 positives, 500 negatives) (TSV, 225 KB)
test-data\Wiki-Doc-Test.tsv : 3826 (1913 positives, 1913 negatives) (TSV, 613 KB)
test-data\Wikipedia (test).tsv : 300 (33 positives, 267 negatives) (TSV, 50 KB)
train-data\Wiki-Doc-Train.tsv, dev-data\Wiki-Doc-Dev.tsv, test-data\Wiki-Doc-Test.tsv contain document-labeled sentences
train-data\10-Ks (2014) unlabeled.tsv, test-data\10-Ks (2018) unlabeled.tsv
train-data\AL-10Ks.tsv, train-data\AL-Wiki.tsv, dev-data\Wikipedia (dev).tsv, test-data\Claims (test).tsv, test-data\Wiki-Doc-Test.tsv, test-data\Wikipedia (test).tsv
Human-labeled sentences, other than the Claims, come from the bigger Document-Labeled or Unlabeled data sets (can be mapped through the "id").
For more information on the data please read the paper "ClimaText: A Dataset for Climate Change Topic Detection".