This package provides a unified solution to "nuggify" encoder-decoder, decoder-only, and encoder-only (experimental) transformers.
The functions of this package are:
- Given the hidden states of the transformer, score tokens and sub-select nuggets
- Build residual connection from the scores to decoder (or encoder, for encoder-only)
Supported models:
- Bart/MBart
- BERT (experimentatal)
- LLaMA
- T5
Supporting new models should not be hard for standard huggingface transformers.
Refer to the adaptor
folder to adapt to new models.
In papers, the residual connection directly adds nugget scores to the attention weights
Where
In this implementation, we by default use straight-through estimator, which is
which cancels the effect of
The original implementation of Nugget implements the residual connection inherits huggingface/transformers models
to pass the scores
argument. Thus it depends on specific huggingface/transformers versions. In this package, the
implementation only modifies the forward method of the XFormerAttention
classes.
The passing of the scores
argument isn't through "forward" method. Instead, a special class NuggetScoreFeeder
handles
it through the context. See usage below.
Here is an example on using it on Bart:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from nugget import nuggify
model_name = 'facebook/bart-base'
ratio = 0.1
model, tok = AutoModelForSeq2SeqLM.from_pretrained(model_name), AutoTokenizer.from_pretrained(model_name)
inputs = tok('hello world', return_tensors='pt')
scorer, encoder, decoder = nuggify(model)
encoder_out = encoder(**inputs)
nuggets = scorer(**inputs, hidden_states=encoder_out.last_hidden_state, nugget_ratio=ratio)
with scorer.score_context(nuggets):
decoder_out = decoder(
encoder_outputs=[nuggets.encoding], labels=inputs['input_ids'], attention_mask=nuggets.mask,
decoder_attention_mask=inputs['attention_mask'],
)
decoder_out.loss.backward()
from nugget import nuggify
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# load from ckpt and construct the Nugget modules
ckpt = torch.load('/path/to/scorer.ckpt')
model_name, nugget_kwargs = ckpt['model_name'], ckpt['kwargs']
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
scorer, encoder, decoder, nugget_kwargs = nuggify(base_model, **nugget_kwargs)
scorer.load_scorer(ckpt['scorer_states'])
# Optionally fix the scorer
scorer.requires_grad_(False)
# example usage
tok = AutoTokenizer.from_pretrained(model_name)
text = 'Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.'
inputs = tok(text, return_tensors='pt')
# subselection happens here. The output is Nugget type.
# check the comments of [Nuggets](nugget/utils/types.py) for more information about Nugget
# note this will only select tokens but will not generate encodings
nuggets = scorer(**inputs)
# `.index` is the indices of the selected tokens.
selected = tok.convert_ids_to_tokens(inputs['input_ids'].gather(1, nuggets.index)[0].tolist())
print(selected)
# alternatively, if you already have T5 encodings, you can pass encoding to the scorer.
encodings = encoder(**inputs)
nuggets = scorer(**inputs, hidden_states=encodings.last_hidden_state)
# encoding is the last-layer embeddings of the selected tokens.
print(nuggets.encoding.shape)
# mask is necessary for a batch of sequences.
print(nuggets.mask.shape)
A nugget scorer scores each token and generate a scalar value to indicate its priority to be selected as nuggets. It is a transformer encoder (first k-layers) stacked with an FFN layer.
To load a scorer alone, you need the pretrained transformer checkpint and the scorer checkpoint.
Say we have meta-llama/Llama-2-7b-chat-hf
and /path/to/scorer.pkl
, then we
from nugget import nuggify
from transformers import AutoModelForCausalLM, AutoTokenizer
pretrained = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(pretrained)
tok = AutoTokenizer.from_pretrained(pretrained)
# ratio is nuggets/tokens
scorer, _, _, _ = nuggify(model, ratio=0.1)
scorer.load_scorer('/path/to/scorer')
text = 'Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. '
inputs = tok(text, return_tensors='pt')
nuggets = scorer(**inputs)
# scorers selects int(ratio*seq_len) tokens as nuggets. Their indices are
print(nuggets.index)
# if you want to use a different/flexible ratio, you can get the raw scores
print(nuggets.all_scores)
Please note the current version of Nugget codebase is tied to huggingface/transformers v4.41.x.
Please cite this paper if Nugget is helpful to your research:
@InProceedings{pmlr-v202-qin23a,
title = {Nugget: Neural Agglomerative Embeddings of Text},
author = {Qin, Guanghui and Van Durme, Benjamin},
booktitle = {Proceedings of the 40th International Conference on Machine Learning},
pages = {28337--28350},
year = {2023},
editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
volume = {202},
series = {Proceedings of Machine Learning Research},
month = {23--29 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v202/qin23a/qin23a.pdf},
url = {https://proceedings.mlr.press/v202/qin23a.html},
abstract = {Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a language model (LM), suggesting new future LMs that can condition on significantly larger amounts of content.}
}
Other related papers that uses the idea of Nugget are Dodo and STAR.