This repository distributes the Norwegian Review Corpus (NoReC), created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author. The accompanying paper by Velldal et al. at LREC 2018 describes the (initial release of the) data in more detail.
NoReC was created as part of the SANT project (Sentiment Analysis for Norwegian Text), a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, the Norwegian Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media. This 2nd release, v.2.1 of the corpus comprises 43,436 review texts extracted from eight different news sources: Dagbladet, VG, Aftenposten, Bergens Tidende, Fædrelandsvennen, Stavanger Aftenblad, DinSide.no and P3.no. In terms of publishing date the reviews mainly cover the time span 2003–2019, although it also includes a handful of reviews dating back as far as 1998.
The data is distributed under a Creative Commons Attribution-NonCommercial licence (CC BY-NC 4.0), access the full license text here: https://creativecommons.org/licenses/by-nc/4.0/
The licence is motivated by the need to block the possibility of third parties redistributing the orignal reviews for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of NoReC are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.
The reviews are distributed as .txt files, split into train, dev, and test sets. The files contain sentence and paragraph segmented texts, processed using UDPipe.
Metadata for each review is provided as a JSON object, all listed in a single file, metadata.json
, indexed on the document id. The JSON objects record properties like the numerical rating (an integer in the range 1–6), the thematic category or domain, the URL of the original document, and so on. It also records which of the two official varieties of Norwegian is used, as detected using langid.py.
Each review is stored as a separate file, with the filename given by the review ID. To facilitate replicability of experiments the corpus comes with pre-defined standard splits for training, development and testing, with a 80–10–10 ratio. The data directory of the distribution is structured as follows, where the train
/dev
/test
directories holds the individual files (e.g. 000042.txt
):
data
├── metadata.json
├── train
├── dev
├── test
git clone https://github.com/ltgoslo/norec
If you publish work that uses or references the data, please cite our LREC article. BibEntry:
@InProceedings{VelOvrBer18,
author = {Erik Velldal and Lilja {\O}vrelid and
Eivind Alexander Bergem and Cathrine Stadsnes and
Samia Touileb and Fredrik J{\o}rgensen},
title = {{NoReC}: The {N}orwegian {R}eview {C}orpus},
booktitle = {Proceedings of the 11th edition of the
Language Resources and Evaluation Conference},
year = {2018},
address = {Miyazaki, Japan},
pages = {4186--4191}
}
All splits combined
year | ap | bt | db | dinside | fvn | p3 | sa | vg | Total |
---|---|---|---|---|---|---|---|---|---|
2003* | 0 | 4 | 0 | 143 | 0 | 25 | 0 | 286 | 458 |
2004 | 0 | 44 | 0 | 142 | 0 | 12 | 19 | 984 | 1201 |
2005 | 0 | 0 | 0 | 179 | 0 | 6 | 224 | 909 | 1318 |
2006 | 0 | 0 | 0 | 240 | 0 | 11 | 294 | 778 | 1323 |
2007 | 0 | 0 | 0 | 139 | 0 | 127 | 400 | 725 | 1391 |
2008 | 0 | 0 | 0 | 119 | 0 | 216 | 369 | 739 | 1443 |
2009 | 0 | 52 | 377 | 163 | 27 | 428 | 259 | 815 | 2121 |
2010 | 0 | 100 | 642 | 260 | 156 | 571 | 309 | 769 | 2807 |
2011 | 1 | 51 | 592 | 284 | 146 | 652 | 362 | 900 | 2988 |
2012 | 2 | 150 | 613 | 257 | 332 | 611 | 561 | 763 | 3289 |
2013 | 4 | 160 | 527 | 216 | 213 | 619 | 433 | 1058 | 3230 |
2014 | 39 | 291 | 501 | 236 | 357 | 546 | 387 | 1191 | 3548 |
2015 | 249 | 235 | 728 | 245 | 456 | 499 | 620 | 849 | 3881 |
2016 | 309 | 340 | 809 | 177 | 321 | 439 | 682 | 715 | 3792 |
2017 | 649 | 491 | 921 | 248 | 692 | 567 | 822 | 687 | 5077 |
2018 | 605 | 470 | 885 | 194 | 466 | 339 | 860 | 492 | 4311 |
2019 | 260 | 167 | 95 | 30 | 160 | 36 | 346 | 165 | 1259 |
2003*
: Including the 31 documents 1998-2002
split | 1 | 2 | 3 | 4 | 5 | 6 | Total |
---|---|---|---|---|---|---|---|
dev | 51 | 225 | 707 | 1409 | 1678 | 278 | 4348 |
test | 27 | 242 | 706 | 1385 | 1714 | 266 | 4340 |
train | 379 | 2287 | 6004 | 11304 | 12614 | 2161 | 34749 |
split | games | literature | misc | music | products | restaurants | screen | sports | stage | Total |
---|---|---|---|---|---|---|---|---|---|---|
dev | 179 | 539 | 28 | 1445 | 347 | 94 | 1569 | 15 | 132 | 4348 |
test | 180 | 547 | 24 | 1444 | 345 | 98 | 1579 | 16 | 107 | 4340 |
train | 1453 | 4337 | 156 | 11777 | 2771 | 745 | 12536 | 118 | 856 | 34749 |
Version 2.1 November 2023:
We have cleaned NoReC, introducing the following changes:
There were previously 4619 texts in the "misc" category. We have assigned the correct category for most these, based on the source categories, source tags and manual inspection. The remaining 208 texts labeled "misc" should now be truly miscellaneous, like reviews of podcasts, art exhibitions and politicians taking part in debates.
We consider the "category" tag to be the best representation of domain for the reviewed entity or event.
177 reviews were found to be duplicates, cross-postings in more than one news outlet in the same media group. This reduced the toal count of reviews from 43614 to 43437.