Multi-LexSum is a multi-doc summarization dataset for civil rights litigation lawsuits with summaries of three granularities.
Update: Multi-LexSum is now on HuggingFace Datasets Hub! Check allenai/multi_lexsum.
from datasets import load_dataset
# please install HuggingFace datasets by pip install datasets
multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20220616")
# Download multi_lexsum locally and load it as a Dataset object
example = multi_lexsum["validation"][0] # The first instance of the dev set
example["sources"] # A list of source document text for the case
for sum_len in ["long", "short", "tiny"]:
print(example["summary/" + sum_len]) # Summaries of three lengths
You can start using the dataset via the provided example.ipynb.
We upload the following trained models to huggingface mode hub for easy access and reproducibility:
Notes:
- In the table above, we use
D
for source documents,L
for long summary,S
for short summary, andT
for tiny summary. - For Multi-Task Summarizers, you can prepend the following prompts to signal the model generating summaries of different lengths:
"summarize:long "
for generating long summaries"summarize:short "
for generating short summaries"summarize:tiny "
for generating tiny summaries
We also list the pre-trained weights for the used models below:
Name | Pretrained Weights |
---|---|
Summarizing Source Documents (Long Models) | |
allenai/led-base-16384-multi_lexsum-source-long |
allenai/led-base-16384 |
allenai/led-base-16384-multi_lexsum-source-short |
allenai/led-base-16384 |
allenai/led-base-16384-multi_lexsum-source-tiny |
allenai/led-base-16384 |
allenai/primera-multi_lexsum-source-long |
allenai/PRIMERA |
allenai/primera-multi_lexsum-source-short |
allenai/PRIMERA |
allenai/primera-multi_lexsum-source-tiny |
allenai/PRIMERA |
Summarizing Summaries | |
allenai/bart-large-multi_lexsum-long-short |
facebook/bart-large-xsum |
allenai/bart-large-multi_lexsum-long-tiny |
facebook/bart-large-xsum |
allenai/bart-large-multi_lexsum-short-tiny |
facebook/bart-large-xsum |
allenai/pegasus-multi_lexsum-long-short |
google/pegasus-xsum |
allenai/pegasus-multi_lexsum-long-tiny |
google/pegasus-xsum |
allenai/pegasus-multi_lexsum-short-tiny |
google/pegasus-xsum |
Multi-Task Summarizers | |
allenai/bart-large-multi_lexsum-source-multitask |
facebook/bart-large-xsum |
allenai/bart-large-multi_lexsum-long-multitask |
facebook/bart-large-xsum |
The Multi-LexSum dataset is distributed under the Open Data Commons Attribution License (ODC-By). The case summaries and metadata are licensed under the Creative Commons Attribution License (CC BY-NC), and the source documents are already in the public domain. Commercial users who desire a license for summaries and metadata can contact info@clearinghouse.net, which will allow free use but limit summary re-posting. The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0.
@article{Shen2022MultiLexSum,
author = {Zejiang Shen and
Kyle Lo and
Lauren Yu and
Nathan Dahlberg and
Margo Schlanger and
Doug Downey},
title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities},
journal = {CoRR},
volume = {abs/2206.10883},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2206.10883},
doi = {10.48550/arXiv.2206.10883}
}