-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Offset based Token Classification utilities #7019
Comments
Hi, this is a very nice issue and I plan to work soon (in the coming 2 weeks) on related things (improving the examples to make full use of the Rust tokenization features). I'll re-read this issue (and all the links) to extract all the details and likely come back to you at that time. In the meantime, here are two elements for your project:
|
Thanks! That's super helpful. I did find a bug and opened an issue from transformers import BertTokenizerFast,GPT2TokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased',)
for i in range(1,5):
txt = "💩"*i
enc = tokenizer(txt,return_offsets_mapping=True)
token_at_i = enc.char_to_token(i-1)
dec = tokenizer.decode(enc['input_ids'])
print (f" I wrote {txt} but got back '{dec}' and char_to_tokens({i-1}) returned {token_at_i}")
|
Hmm I think we should have an option to pad the labels in |
I think that presupposes that the user has labels aligned to tokens, or that their is one and only one right way to align labels and tokens, which isn't consistent with the original issue. When that's not the case, then we need to tokenize, then align labels and finally pad. (Also need to deal with overflow, but I haven't gotten that far yet) . Notably, the user may want to use a BIO,BILSO or other schema and needs access to the tokens to modify the labels accordingly. Something that confused me as I've been working on this is that the _pad function operates explicitly on named attributes of the batch encoding dict whereas as a user I'd expect it to operate on everything in the underlying Because of the logic involved in alignment, I think that padding of the tokens and labels might be better done outside of the tokenizer, probably with a specialized function / module. Also, I think that's a theoretical point because it seems that the padding is done in python anyway ? I ended up doing def tokenize_with_labels(
texts: List[str],
raw_labels: List[List[SpanAnnotation]],
tokenizer: PreTrainedTokenizerFast,
label_set: LabelSet, #Basically the alignment strategy
):
batch_encodings = tokenizer(
texts,
return_offsets_mapping=True,
padding="longest",
max_length=256,
truncation=True,
)
batch_labels: IntListList = []
for encoding, annotations in zip(batch_encodings.encodings, raw_labels):
batch_labels.append(label_set.align_labels_to_tokens(encoding, annotations))
return batch_encodings, batch_labels where align_labels_to_tokens operates on already padded tokens. I found this the most convenient way to get dynamic batches with a collator @dataclass
class LTCollator:
tokenizer: PreTrainedTokenizerFast
label_set: LabelSet
padding: PaddingStrategy = True
max_length: Optional[int] = None
def __call__(self, texts_and_labels: Example) -> BatchEncoding:
texts: List[str] = []
annotations: List[List[SpanAnnotation]] = []
for (text, annos) in texts_and_labels:
texts.append(text)
annotations.append(annos)
batch, labels = tokenize_with_labels(
texts, annotations, self.tokenizer, label_set=self.label_set
)
del batch["offset_mapping"]
batch.data["labels"] = labels # Put the labels in the BatchEncoding dict
tensors = batch.convert_to_tensors(tensor_type="pt") # convert to a tensor
return tensors |
As an example of the end to end flow, (and please No one use this it's a probably buggy work in progress) from typing import Any, Optional, List, Tuple
from transformers import (
BertTokenizerFast,
BertModel,
BertForMaskedLM,
BertForTokenClassification,
TrainingArguments,
)
import torch
from transformers import AdamW, Trainer
from dataclasses import dataclass
from torch.utils.data import Dataset
import json
from torch.utils.data.dataloader import DataLoader
from transformers import PreTrainedTokenizerFast, DataCollatorWithPadding, BatchEncoding
from transformers.tokenization_utils_base import PaddingStrategy
from labelset import LabelSet
from token_types import IntListList, SpanAnnotation
from tokenize_with_labels import tokenize_with_labels
Example = Tuple[str, List[List[SpanAnnotation]]]
@dataclass
class LTCollator:
tokenizer: PreTrainedTokenizerFast
label_set: LabelSet
padding: PaddingStrategy = True
max_length: Optional[int] = None
def __call__(self, texts_and_labels: Example) -> BatchEncoding:
texts: List[str] = []
annotations: List[List[SpanAnnotation]] = []
for (text, annos) in texts_and_labels:
texts.append(text)
annotations.append(annos)
batch, labels = tokenize_with_labels(
texts, annotations, self.tokenizer, label_set=self.label_set
)
del batch["offset_mapping"]
batch.data["labels"] = labels # Put the labels in the BatchEncoding dict
tensors = batch.convert_to_tensors(tensor_type="pt") # convert to a tensor
return tensors
class LTDataset(Dataset):
def __init__(
self, data: Any, tokenizer: PreTrainedTokenizerFast,
):
self.tokenizer = tokenizer
for example in data["examples"]:
for a in example["annotations"]:
a["label"] = a["tag"]
self.texts = []
self.annotations = []
for example in data["examples"]:
self.texts.append(example["content"])
self.annotations.append(example["annotations"])
def __len__(self):
return len(self.texts)
def __getitem__(self, idx) -> Example:
return self.texts[idx], self.annotations[idx]
@dataclass
class LTDataControls:
dataset: LTDataset
collator: LTCollator
label_set: LabelSet
def lt_data_factory(
json_path: str, tokenizer: PreTrainedTokenizerFast, max_length=None
):
data = json.load(open(json_path))
dataset = LTDataset(data=data, tokenizer=tokenizer)
tags = list(map(lambda x: x["name"], data["schema"]["tags"]))
label_set = LabelSet(tags)
collator = LTCollator(
max_length=max_length, label_set=label_set, tokenizer=tokenizer
)
return LTDataControls(dataset=dataset, label_set=label_set, collator=collator)
if __name__ == "__main__":
from transformers import BertTokenizerFast, GPT2TokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased",)
data_controls = lt_data_factory(
"/home/tal/Downloads/small_gold_no_paragr_location_types_false_5_annotations.json",
tokenizer=tokenizer,
max_length=256,
)
dl = DataLoader(
data_controls.dataset, collate_fn=data_controls.collator, batch_size=10
)
model = BertForTokenClassification.from_pretrained(
"bert-base-cased", num_labels=len(data_controls.label_set.ids_to_label.values())
)
train = Trainer(
model=model,
data_collator=data_controls.collator,
train_dataset=data_controls.dataset,
args=TrainingArguments("/tmp/trainer", per_device_train_batch_size=2),
)
train.train() |
Also, I found this comment by @sgugger about the trainer
I think that sentiment might make sense here, that what I'm looking for is outside the scope of the library. If that's the case I would have preferred it be written in big bold letters, rather than the library trying to cater to this use case |
So, It even comes with a repo so [{'annotations': [],
'content': 'No formal drug interaction studies of Aranesp? have been '
'performed.',
'metadata': {'original_id': 'DrugDDI.d390.s0'}},
{'annotations': [{'end': 13, 'label': 'drug', 'start': 6, 'tag': 'drug'},
{'end': 60, 'label': 'drug', 'start': 43, 'tag': 'drug'},
{'end': 112, 'label': 'drug', 'start': 105, 'tag': 'drug'},
{'end': 177, 'label': 'drug', 'start': 164, 'tag': 'drug'},
{'end': 194, 'label': 'drug', 'start': 181, 'tag': 'drug'},
{'end': 219, 'label': 'drug', 'start': 211, 'tag': 'drug'},
{'end': 238, 'label': 'drug', 'start': 227, 'tag': 'drug'}],
'content': 'Since PLETAL is extensively metabolized by cytochrome P-450 '
'isoenzymes, caution should be exercised when PLETAL is '
'coadministered with inhibitors of C.P.A. such as ketoconazole '
'and erythromycin or inhibitors of CYP2C19 such as omeprazole.',
'metadata': {'original_id': 'DrugDDI.d452.s0'}},
{'annotations': [{'end': 58, 'label': 'drug', 'start': 47, 'tag': 'drug'},
{'end': 75, 'label': 'drug', 'start': 62, 'tag': 'drug'},
{'end': 135, 'label': 'drug', 'start': 124, 'tag': 'drug'},
{'end': 164, 'label': 'drug', 'start': 152, 'tag': 'drug'}],
'content': 'Pharmacokinetic studies have demonstrated that omeprazole and '
'erythromycin significantly increased the systemic exposure of '
'cilostazol and/or its major metabolites.',
'metadata': {'original_id': 'DrugDDI.d452.s1'}}] We can do this from sequence_aligner.labelset import LabelSet
from sequence_aligner.dataset import TrainingDataset
from sequence_aligner.containers import TraingingBatch
import json
raw = json.load(open('./data/ddi_train.json'))
for example in raw:
for annotation in example['annotations']:
#We expect the key of label to be label but the data has tag
annotation['label'] = annotation['tag']
from torch.utils.data import DataLoader
from transformers import BertForTokenClassification,AdamW
model = BertForTokenClassification.from_pretrained(
"bert-base-cased", num_labels=len(dataset.label_set.ids_to_label.values())
)
optimizer = AdamW(model.parameters(), lr=5e-6)
dataloader = DataLoader(
dataset,
collate_fn=TraingingBatch,
batch_size=4,
shuffle=True,
)
for num, batch in enumerate(dataloader):
loss, logits = model(
input_ids=batch.input_ids,
attention_mask=batch.attention_masks,
labels=batch.labels,
)
loss.backward()
optimizer.step()
-------------------------------
I think most of this is out of scope for the transformers library itself, so am all for closing this issue if no one objects |
(I attempted to fix the links above, let me know if this is correct @talolard) |
Links seem kosher, thanks |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🚀 Feature request
Hi. So we work a lot with span annotations on text that isn't tokenized and want a "canonical" way to work with that. I have some ideas and rough implementations, so I'm looking for feedback on if this belongs in the library, and if the proposed implementation is more or less good.
I also think there is a good chance that everything I want exists, and the only solution needed is slightly clearer documentation. I should hope that's the case and happy to document if someone can point me in the right direction.
The Desired Capabilities
What I'd like is a canonical way to:
Some Nice To Haves
Current State and what I'm missing
Alignment
The path to align tokens to span annotations is by using the return_offsets_mapping flag on the tokenizer (which is awesome!).
There are probably a few strategies, I've been using this
I use logic like this:
And then call that function inside add_labels here
This works, and it's nice because the padding is consistent with the longest sentence so bucketing gives a big boost. But, the add_labels stuff is in python and thus sequential over the examples and not super fast. I haven't measured this to confirm it's a problem, just bring it up.
Desired Solution
I need most of this stuff so I'm going to make it. I could do it
The current "NER" examples and issues assume that text is pre-tokenized. Our use case is such that the full text is not tokenized and the labels for "NER" come as offsets. I propose a utility /example to handle that scenario because I haven't been able to find one.
In practice, most values of X don't need any modification, and doing what I propose (below) in Rust is beyond me, so this might boil down to a utility class and documentation.
Motivation
I make text annotation tools and our output is span annotations on untokenized text. I want our users to be able to easily use transformers. I suspect from my (limited) experience that in many non-academic use cases, span annotations on untokenized text is the norm and that others would benefit from this as well.
Possible ways to address this
I can imagine a few scenarios here
Your contribution
I'd be happy to implement and submit a PR, or make an external library or add to a relevant existing one.
Related issues
The text was updated successfully, but these errors were encountered: