Adding data and instructions to recreate booksum experiments

salesforce · May 18, 2021 · 0676ff6 · 0676ff6
commit 0676ff6
Show file tree

Hide file tree

Showing 34 changed files with 26,199 additions and 0 deletions.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,105 @@
+Salesforce Open Source Community Code of Conduct
+
+## About the Code of Conduct
+
+Equality is a core value at Salesforce. We believe a diverse and inclusive
+community fosters innovation and creativity, and are committed to building a
+culture where everyone feels included.
+
+Salesforce open-source projects are committed to providing a friendly, safe, and
+welcoming environment for all, regardless of gender identity and expression,
+sexual orientation, disability, physical appearance, body size, ethnicity, nationality, 
+race, age, religion, level of experience, education, socioeconomic status, or 
+other similar personal characteristics.
+
+The goal of this code of conduct is to specify a baseline standard of behavior so
+that people with different social values and communication styles can work
+together effectively, productively, and respectfully in our open source community.
+It also establishes a mechanism for reporting issues and resolving conflicts.
+
+All questions and reports of abusive, harassing, or otherwise unacceptable behavior
+in a Salesforce open-source project may be reported by contacting the Salesforce
+Open Source Conduct Committee at ossconduct@salesforce.com.
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of gender 
+identity and expression, sexual orientation, disability, physical appearance, 
+body size, ethnicity, nationality, race, age, religion, level of experience, education, 
+socioeconomic status, or other similar personal characteristics.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* * Being respectful of differing viewpoints and experiences
+* * Gracefully accepting constructive criticism
+* * Focusing on what is best for the community
+* * Showing empathy toward other community members
+*
+* Examples of unacceptable behavior by participants include:
+*
+* * The use of sexualized language or imagery and unwelcome sexual attention or
+* advances
+* * Personal attacks, insulting/derogatory comments, or trolling
+* * Public or private harassment
+* * Publishing, or threatening to publish, others' private information—such as
+* a physical or electronic address—without explicit permission
+* * Other conduct which could reasonably be considered inappropriate in a
+* professional setting
+* * Advocating for or encouraging any of the above behaviors
+*
+* ## Our Responsibilities
+*
+* Project maintainers are responsible for clarifying the standards of acceptable
+* behavior and are expected to take appropriate and fair corrective action in
+* response to any instances of unacceptable behavior.
+*
+* Project maintainers have the right and responsibility to remove, edit, or
+* reject comments, commits, code, wiki edits, issues, and other contributions
+* that are not aligned with this Code of Conduct, or to ban temporarily or
+* permanently any contributor for other behaviors that they deem inappropriate,
+* threatening, offensive, or harmful.
+*
+* ## Scope
+*
+* This Code of Conduct applies both within project spaces and in public spaces
+* when an individual is representing the project or its community. Examples of
+* representing a project or community include using an official project email
+* address, posting via an official social media account, or acting as an appointed
+* representative at an online or offline event. Representation of a project may be
+* further defined and clarified by project maintainers.
+*
+* ## Enforcement
+*
+* Instances of abusive, harassing, or otherwise unacceptable behavior may be
+* reported by contacting the Salesforce Open Source Conduct Committee 
+* at ossconduct@salesforce.com. All complaints will be reviewed and investigated 
+* and will result in a response that is deemed necessary and appropriate to the 
+* circumstances. The committee is obligated to maintain confidentiality with 
+* regard to the reporter of an incident. Further details of specific enforcement 
+* policies may be posted separately.
+*
+* Project maintainers who do not follow or enforce the Code of Conduct in good
+* faith may face temporary or permanent repercussions as determined by other
+* members of the project's leadership and the Salesforce Open Source Conduct 
+* Committee.
+*
+* ## Attribution
+*
+* This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
+* version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html. 
+* It includes adaptions and additions from [Go Community Code of Conduct][golang-coc], 
+* [CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].
+*
+* This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].
+*
+* [contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
+* [golang-coc]: https://golang.org/conduct
+* [cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
+* [microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
+* [cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,12 @@
+Copyright (c) 2021, Salesforce.com, Inc.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+* Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -0,0 +1,125 @@
+# BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
+Authors: [Wojciech Kryściński](https://twitter.com/iam_wkr), [Nazneen Rajani](https://twitter.com/nazneenrajani), [Divyansh Agarwal](https://twitter.com/jigsaw2212), [Caiming Xiong](https://twitter.com/caimingxiong), [Dragomir Radev](http://www.cs.yale.edu/homes/radev/)
+
+## Introduction
+The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. 
+While relevant, such datasets will offer limited challenges for future generations of text summarization systems.
+We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization.
+Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level.
+The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
+To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.
+
+Paper link: https://arxiv.org/abs/XXXXXXX
+
+<p align="center"><img src="misc/book_sumv4.png"></p>
+
+## Table of Contents
+
+1. [Updates](#updates)
+2. [Citation](#citation)
+3. [Legal Note](#legal-note)
+4. [License](#license)
+5. [Usage](#usage)
+6. [Get Involved](#get-involved)
+
+## Updates
+#### 4/15/2021
+Initial commit
+
+
+## Citation
+```
+@article{XXXXXX:2021,
+  author    = {},
+  title     = {BOOKSum: A Collection of Datasets for Long-form Narrative Summarization},
+  journal   = {arXiv preprint arXiv:XXXXXXX},
+  year      = {2021},
+}
+```
+
+## Legal Note
+By downloading or using the resources, including any code or scripts, shared in this code
+repository, you hereby agree to the following terms, and your use of the resources is conditioned
+on and subject to these terms.
+1. You may only use the scripts shared in this code repository for research purposes. You
+may not use or allow others to use the scripts for any other purposes and other uses are
+expressly prohibited.
+2. You will comply with all terms and conditions, and are responsible for obtaining all
+rights, related to the services you access and the data you collect.
+3. We do not make any representations or warranties whatsoever regarding the sources from
+which data is collected. Furthermore, we are not liable for any damage, loss or expense of
+any kind arising from or relating to your use of the resources shared in this code
+repository or the data collected, regardless of whether such liability is based in tort,
+contract or otherwise.
+
+## License
+The code is released under the **BSD-3 License** (see `LICENSE.txt` for details).
+
+
+## Usage
+
+#### 1. Chapterized Project Guteberg Data
+The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:
+```
+gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .
+```
+
+or downloaded directly [here](https://storage.cloud.google.com/sfr-books-dataset-chapters-research/all_chapterized_books.zip).
+
+#### 2. Data Collection
+Data collection scripts for the summary text are organized by the different sources that we use summaries from.
+Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources. 
+
+For each data source, run `get_works.py` to first fetch the links for each book, and then run `get_summaries.py` to get the summaries from the collected links.
+
+```
+python scripts/data_collection/cliffnotes/get_works.py
+python scripts/data_collection/cliffnotes/get_summaries.py
+```
+
+#### 3. Data Cleaning
+
+Data Cleaning is performed through the following steps:
+
+First script for some basic cleaning operations, like removing parentheses, links etc from the summary text
+```
+python scripts/data_cleaning_scripts/basic_clean.py
+```
+
+We use intermediate alignments in  summary_chapter_matched_all_sources.jsonl to identify which summaries are separable, and separates them, creating new summaries (eg. Chapters 1-3 summary separated into 3 different files - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)
+```
+python scripts/data_cleaning_scripts/split_aggregate_chaps_all_sources.py
+```
+
+Lastly, our final cleaning script using various regexes to separate out analysis/commentary text, removes prefixes, suffixes etc.
+```
+python scripts/data_cleaning_scripts/clean_summaries.py
+```
+
+#### Data Alignments
+Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:
+
+Gather the data from the summaries and book chapters into a single jsonl
+```
+python paragraph-level-summary-alignments/gather_data.py
+```
+
+Generate alignments of the paragraphs with sentences from the summary using the bi-encoder **paraphrase-distilroberta-base-v1**
+```
+python paragraph-level-summary-alignments/align_data_bi_encoder_paraphrase.py
+```
+
+Aggregate the generated alignments for cases where multiple sentences from chapter-summaries are matched to the same paragraph from the book
+```
+python paragraph-level-summary-alignments/aggregate_paragraph_alignments_bi_encoder_paraphrase.py
+```
+
+## Troubleshooting
+1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
+2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.
+
+
+## Get Involved
+Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. 
+We welcome PRs!
+
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,7 @@
+Security
+
+Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
+as soon as it is discovered. This library limits its runtime dependencies in
+order to reduce the total cost of ownership as much as can be, but all consumers
+should remain vigilant and have their security stakeholders review all third-party
+Products (3PP) like this one and their dependencies.