Skip to content

Commit

Permalink
Adding data and instructions to recreate booksum experiments
Browse files Browse the repository at this point in the history
  • Loading branch information
jigsaw2212 committed May 18, 2021
0 parents commit 0676ff6
Show file tree
Hide file tree
Showing 34 changed files with 26,199 additions and 0 deletions.
105 changes: 105 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
Salesforce Open Source Community Code of Conduct

## About the Code of Conduct

Equality is a core value at Salesforce. We believe a diverse and inclusive
community fosters innovation and creativity, and are committed to building a
culture where everyone feels included.

Salesforce open-source projects are committed to providing a friendly, safe, and
welcoming environment for all, regardless of gender identity and expression,
sexual orientation, disability, physical appearance, body size, ethnicity, nationality,
race, age, religion, level of experience, education, socioeconomic status, or
other similar personal characteristics.

The goal of this code of conduct is to specify a baseline standard of behavior so
that people with different social values and communication styles can work
together effectively, productively, and respectfully in our open source community.
It also establishes a mechanism for reporting issues and resolving conflicts.

All questions and reports of abusive, harassing, or otherwise unacceptable behavior
in a Salesforce open-source project may be reported by contacting the Salesforce
Open Source Conduct Committee at ossconduct@salesforce.com.

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of gender
identity and expression, sexual orientation, disability, physical appearance,
body size, ethnicity, nationality, race, age, religion, level of experience, education,
socioeconomic status, or other similar personal characteristics.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* * Being respectful of differing viewpoints and experiences
* * Gracefully accepting constructive criticism
* * Focusing on what is best for the community
* * Showing empathy toward other community members
*
* Examples of unacceptable behavior by participants include:
*
* * The use of sexualized language or imagery and unwelcome sexual attention or
* advances
* * Personal attacks, insulting/derogatory comments, or trolling
* * Public or private harassment
* * Publishing, or threatening to publish, others' private information—such as
* a physical or electronic address—without explicit permission
* * Other conduct which could reasonably be considered inappropriate in a
* professional setting
* * Advocating for or encouraging any of the above behaviors
*
* ## Our Responsibilities
*
* Project maintainers are responsible for clarifying the standards of acceptable
* behavior and are expected to take appropriate and fair corrective action in
* response to any instances of unacceptable behavior.
*
* Project maintainers have the right and responsibility to remove, edit, or
* reject comments, commits, code, wiki edits, issues, and other contributions
* that are not aligned with this Code of Conduct, or to ban temporarily or
* permanently any contributor for other behaviors that they deem inappropriate,
* threatening, offensive, or harmful.
*
* ## Scope
*
* This Code of Conduct applies both within project spaces and in public spaces
* when an individual is representing the project or its community. Examples of
* representing a project or community include using an official project email
* address, posting via an official social media account, or acting as an appointed
* representative at an online or offline event. Representation of a project may be
* further defined and clarified by project maintainers.
*
* ## Enforcement
*
* Instances of abusive, harassing, or otherwise unacceptable behavior may be
* reported by contacting the Salesforce Open Source Conduct Committee
* at ossconduct@salesforce.com. All complaints will be reviewed and investigated
* and will result in a response that is deemed necessary and appropriate to the
* circumstances. The committee is obligated to maintain confidentiality with
* regard to the reporter of an incident. Further details of specific enforcement
* policies may be posted separately.
*
* Project maintainers who do not follow or enforce the Code of Conduct in good
* faith may face temporary or permanent repercussions as determined by other
* members of the project's leadership and the Salesforce Open Source Conduct
* Committee.
*
* ## Attribution
*
* This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
* version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html.
* It includes adaptions and additions from [Go Community Code of Conduct][golang-coc],
* [CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].
*
* This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].
*
* [contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
* [golang-coc]: https://golang.org/conduct
* [cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
* [microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
* [cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/
12 changes: 12 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Copyright (c) 2021, Salesforce.com, Inc.
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
125 changes: 125 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Authors: [Wojciech Kryściński](https://twitter.com/iam_wkr), [Nazneen Rajani](https://twitter.com/nazneenrajani), [Divyansh Agarwal](https://twitter.com/jigsaw2212), [Caiming Xiong](https://twitter.com/caimingxiong), [Dragomir Radev](http://www.cs.yale.edu/homes/radev/)

## Introduction
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases.
While relevant, such datasets will offer limited challenges for future generations of text summarization systems.
We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization.
Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level.
The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

Paper link: https://arxiv.org/abs/XXXXXXX

<p align="center"><img src="misc/book_sumv4.png"></p>

## Table of Contents

1. [Updates](#updates)
2. [Citation](#citation)
3. [Legal Note](#legal-note)
4. [License](#license)
5. [Usage](#usage)
6. [Get Involved](#get-involved)

## Updates
#### 4/15/2021
Initial commit


## Citation
```
@article{XXXXXX:2021,
author = {},
title = {BOOKSum: A Collection of Datasets for Long-form Narrative Summarization},
journal = {arXiv preprint arXiv:XXXXXXX},
year = {2021},
}
```

## Legal Note
By downloading or using the resources, including any code or scripts, shared in this code
repository, you hereby agree to the following terms, and your use of the resources is conditioned
on and subject to these terms.
1. You may only use the scripts shared in this code repository for research purposes. You
may not use or allow others to use the scripts for any other purposes and other uses are
expressly prohibited.
2. You will comply with all terms and conditions, and are responsible for obtaining all
rights, related to the services you access and the data you collect.
3. We do not make any representations or warranties whatsoever regarding the sources from
which data is collected. Furthermore, we are not liable for any damage, loss or expense of
any kind arising from or relating to your use of the resources shared in this code
repository or the data collected, regardless of whether such liability is based in tort,
contract or otherwise.

## License
The code is released under the **BSD-3 License** (see `LICENSE.txt` for details).


## Usage

#### 1. Chapterized Project Guteberg Data
The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:
```
gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .
```

or downloaded directly [here](https://storage.cloud.google.com/sfr-books-dataset-chapters-research/all_chapterized_books.zip).

#### 2. Data Collection
Data collection scripts for the summary text are organized by the different sources that we use summaries from.
Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources.

For each data source, run `get_works.py` to first fetch the links for each book, and then run `get_summaries.py` to get the summaries from the collected links.

```
python scripts/data_collection/cliffnotes/get_works.py
python scripts/data_collection/cliffnotes/get_summaries.py
```

#### 3. Data Cleaning

Data Cleaning is performed through the following steps:

First script for some basic cleaning operations, like removing parentheses, links etc from the summary text
```
python scripts/data_cleaning_scripts/basic_clean.py
```

We use intermediate alignments in summary_chapter_matched_all_sources.jsonl to identify which summaries are separable, and separates them, creating new summaries (eg. Chapters 1-3 summary separated into 3 different files - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)
```
python scripts/data_cleaning_scripts/split_aggregate_chaps_all_sources.py
```

Lastly, our final cleaning script using various regexes to separate out analysis/commentary text, removes prefixes, suffixes etc.
```
python scripts/data_cleaning_scripts/clean_summaries.py
```

#### Data Alignments
Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:

Gather the data from the summaries and book chapters into a single jsonl
```
python paragraph-level-summary-alignments/gather_data.py
```

Generate alignments of the paragraphs with sentences from the summary using the bi-encoder **paraphrase-distilroberta-base-v1**
```
python paragraph-level-summary-alignments/align_data_bi_encoder_paraphrase.py
```

Aggregate the generated alignments for cases where multiple sentences from chapter-summaries are matched to the same paragraph from the book
```
python paragraph-level-summary-alignments/aggregate_paragraph_alignments_bi_encoder_paraphrase.py
```

## Troubleshooting
1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.


## Get Involved
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports.
We welcome PRs!

7 changes: 7 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Security

Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
as soon as it is discovered. This library limits its runtime dependencies in
order to reduce the total cost of ownership as much as can be, but all consumers
should remain vigilant and have their security stakeholders review all third-party
Products (3PP) like this one and their dependencies.
Loading

0 comments on commit 0676ff6

Please sign in to comment.