CiteSum

This repo provides the dataset, model checkpoints, and code for paper "CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation".

TLDR: By pretraining on (automatically extracted) citation sentences in scientific papers, we achieve SOTA on SciTLDR, XSum, and Gigaword in zero-shot and/or few-shot settings.

How to run (Huggingface)

CiteSum is on Huggingface Hub. You can load it simply by the following (credit @nbroad1881):

from datasets import load_dataset

ds = load_dataset("yuningm/citesum")

To use our model pretrained on citation texts:

from transformers import pipeline
summarizer = pipeline("summarization", model="yuningm/bart-large-citesum")

article = ''' We describe a convolutional neural network that learns\
 feature representations for short textual posts using hashtags as a\
  supervised signal. The proposed approach is trained on up to 5.5 \
  billion words predicting 100,000 possible hashtags. As well as strong\
   performance on the hashtag prediction task itself, we show that its \
   learned representation of text (ignoring the hashtag labels) is useful\
    for other tasks as well. To that end, we present results on a document\
     recommendation task, where it also outperforms a number of baselines.
'''
summarizer(article)
# [{'summary_text': 'REF proposed a convolutional neural network 
# that learns feature representations for short textual posts 
# using hashtags as a supervised signal.'}]

To use our model further pretrained on paper titles:

from transformers import pipeline
summarizer = pipeline("summarization", model="yuningm/bart-large-citesum-title")

article = ''' We describe a convolutional neural network that learns\
 feature representations for short textual posts using hashtags as a\
  supervised signal. The proposed approach is trained on up to 5.5 \
  billion words predicting 100,000 possible hashtags. As well as strong\
   performance on the hashtag prediction task itself, we show that its \
   learned representation of text (ignoring the hashtag labels) is useful\
    for other tasks as well. To that end, we present results on a document\
     recommendation task, where it also outperforms a number of baselines.
'''
summarizer(article)
# [{'summary_text': 'Learning Text Representations from Hashtags using Convolutional Neural Networks'}]

How to run (DIY)

We also provide the dataset and checkpoints pretrained on its citation sentences and titles in Google Drive.

Check out example scripts under script/ to see how to train/evaluate on different datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data_processing		data_processing
script		script
transformers		transformers
README.md		README.md
eval_tldr.py		eval_tldr.py
run_seq2seq.py		run_seq2seq.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CiteSum

How to run (Huggingface)

How to run (DIY)

About

Languages

morningmoni/CiteSum

Folders and files

Latest commit

History

Repository files navigation

CiteSum

How to run (Huggingface)

How to run (DIY)

About

Topics

Resources

Stars

Watchers

Forks

Languages