Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavily improve automatic model card generation + Patch XLM-R #28

Merged
merged 95 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
b413172
Uncomment pushing to the Hub
tomaarsen Aug 16, 2023
bd036b6
Initial version to improve automatic model card generation
tomaarsen Aug 16, 2023
782c1d2
Simplify label normalization
tomaarsen Aug 16, 2023
648c505
Automatically select some eval sentences for the widget
tomaarsen Aug 16, 2023
12a04bc
Improve language card
tomaarsen Aug 16, 2023
23ca683
Add automatic evaluation results
tomaarsen Aug 16, 2023
c762433
Use dash instead of underscore in model name
tomaarsen Aug 24, 2023
4c7d402
Add extra TODOs
tomaarsen Aug 24, 2023
cbdcdac
model.predict text as the first example
tomaarsen Aug 26, 2023
4c33537
Automatically set model name based on encoder & dataset
tomaarsen Aug 26, 2023
b9c5d4e
Remove accidental Dataset import
tomaarsen Aug 27, 2023
d7dc4ac
Rename examples to widget examples
tomaarsen Aug 27, 2023
a6d5e1e
Add table with label examples
tomaarsen Aug 27, 2023
ccd136b
Ensure complete metadata
tomaarsen Aug 27, 2023
28c67a7
Add tokenizer warning if punct must be split from words
tomaarsen Aug 27, 2023
c016e4d
Remove dead code
tomaarsen Aug 27, 2023
7e2e800
Rename poor variable names
tomaarsen Aug 27, 2023
c1d6967
Fix incorrect warning
tomaarsen Aug 27, 2023
f9fe787
Add " in the model labels
tomaarsen Aug 27, 2023
118e695
Set model_id based on args if possible
tomaarsen Aug 27, 2023
e2cb59a
Add training set metrics
tomaarsen Aug 28, 2023
ab4476f
Randomly select 100 samples for the widget examples
tomaarsen Aug 28, 2023
7f44cb9
Prevent duplicate widget examples
tomaarsen Aug 28, 2023
e693018
Remove completed TODO
tomaarsen Aug 28, 2023
41589c8
Use title case throughout model card
tomaarsen Aug 28, 2023
c2a06a0
Add useful comments if values not provided
tomaarsen Aug 28, 2023
6dd0a84
Add environmental impact with codecarbon
tomaarsen Aug 28, 2023
6188d8e
Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…
tomaarsen Aug 29, 2023
19f7c3c
Ensure that the model card template is included in the install
tomaarsen Aug 29, 2023
caf09ba
Add training hardware section
tomaarsen Sep 5, 2023
56861e9
Add Python version
tomaarsen Sep 5, 2023
fdf9ecf
Make everything title case
tomaarsen Sep 5, 2023
be11866
Add missing docstring
tomaarsen Sep 5, 2023
d7848ac
Add docstring for SpanMarkerModelCardData
tomaarsen Sep 5, 2023
aecb6f4
Update CHANGELOG
tomaarsen Sep 5, 2023
6d15a37
Add SpanMarkerModelCardData to dunder init
tomaarsen Sep 5, 2023
a5d6b50
Add SpanMarkerModelCardData to snippets
tomaarsen Sep 5, 2023
aaf0545
Resolve breaking error if hub_model_id is set
tomaarsen Sep 5, 2023
e7e0a43
gpu_model -> hardware_used
tomaarsen Sep 6, 2023
8fd4c1b
Add "base_model" to metadata
tomaarsen Sep 6, 2023
151b3cf
Increment datasets min version to 2.14.0
tomaarsen Sep 6, 2023
5e6bf4d
Update trainer evaluate tests
tomaarsen Sep 6, 2023
aa5153e
Skip old model card test for now
tomaarsen Sep 6, 2023
280b601
Fix edge case: less than 5 examples
tomaarsen Sep 6, 2023
b71f96d
pytest.skip -> pytest.mark.skip
tomaarsen Sep 12, 2023
92b9de1
Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…
tomaarsen Sep 12, 2023
b64119c
Try to infer the language from the dataset
tomaarsen Sep 12, 2023
179971f
Add citations and hidden sections
tomaarsen Sep 12, 2023
5db7f8a
Refactor inferring language
tomaarsen Sep 12, 2023
ff5db00
Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…
tomaarsen Sep 13, 2023
0b3ec56
Remove unused import
tomaarsen Sep 13, 2023
e6d517a
Add comment explaining version
tomaarsen Sep 13, 2023
73b13ef
Override default Trainer create_model_card
tomaarsen Sep 13, 2023
9de5d24
Update model card template slightly
tomaarsen Sep 13, 2023
6daf082
Add newline to model card template
tomaarsen Sep 13, 2023
edf6015
Remove incorrect space
tomaarsen Sep 13, 2023
d301225
Add model card tests
tomaarsen Sep 13, 2023
f708afe
Improve Trainer tests regarding model card
tomaarsen Sep 13, 2023
f0d11fa
Remove commented out breakpoint
tomaarsen Sep 13, 2023
b271eb6
Add codecarbon to CI
tomaarsen Sep 13, 2023
c91f17f
Rename integration extra to codecarbon
tomaarsen Sep 13, 2023
0b56d28
Make hardware_used optional (if no GPU present)
tomaarsen Sep 13, 2023
3e46869
Apply suggestions to model_card_template
tomaarsen Sep 14, 2023
ef0ea18
Update model card test pattern alongside template changes
tomaarsen Sep 14, 2023
f6e730a
Don't include hardware_used when no GPU present
tomaarsen Sep 14, 2023
32617c6
Set "No GPU used" for GPU Model if hardware_used is None
tomaarsen Sep 14, 2023
7dc4acd
Don't store None in yaml
tomaarsen Sep 14, 2023
a6c5689
Ensure that emissions is a regular float
tomaarsen Sep 14, 2023
6ed39b5
kgs to g
tomaarsen Sep 14, 2023
53c7321
support e-05 notation
tomaarsen Sep 14, 2023
1a1480d
Add small test case for model cards
tomaarsen Sep 14, 2023
9ddcdd9
Update model tables in docs
tomaarsen Sep 14, 2023
96ec42b
Link to the spaCy integration in the tokenizer warning
tomaarsen Sep 14, 2023
3126200
Update README snippet
tomaarsen Sep 15, 2023
5619fcc
Update outdated docs: entity_max_length default is 8
tomaarsen Sep 15, 2023
40154e4
Remove /models from URL, caused 404s
tomaarsen Sep 26, 2023
bd838b0
Fix outdated type hint
tomaarsen Sep 26, 2023
f2edd06
🎉 Apply XLM-R patch
tomaarsen Sep 26, 2023
084e2d0
Remove /models from test
tomaarsen Sep 26, 2023
c5f72a5
Remove tokenizer warning after patch
tomaarsen Sep 26, 2023
eea3880
Update training docs with model card data etc.
tomaarsen Sep 26, 2023
4a70e18
Pad token embeddings to multiple of 8
tomaarsen Sep 26, 2023
ec90a80
Always attach list directly to header
tomaarsen Sep 27, 2023
457d75e
Tackle edge case where dataset card has no metadata
tomaarsen Sep 27, 2023
adb0de6
Allow installing nltk for detokenizing model card examples
tomaarsen Sep 27, 2023
35b43c4
Add model card docs
tomaarsen Sep 27, 2023
93f1689
Mention codecarbon install in docstring
tomaarsen Sep 27, 2023
4e02a16
overwrite the default codecarbon log level to "error"
tomaarsen Sep 27, 2023
aebb4aa
Update CHANGELOG
tomaarsen Sep 28, 2023
08000f8
Fix issue with inference example containing full quotes
tomaarsen Sep 28, 2023
cfb7577
Update CHANGELOG
tomaarsen Sep 28, 2023
e5911bf
Never print a model when printing SpanMarkerModelCardData
tomaarsen Sep 28, 2023
7d1fa8b
Try to infer the dataset_id from the training set
tomaarsen Sep 28, 2023
9321361
Merge branch 'main' into feat/improved_model_cards
tomaarsen Sep 29, 2023
2f753e5
Update the main docs landing page
tomaarsen Sep 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
- name: Install external dependencies on cache miss
run: |
python -m pip install --no-cache-dir --upgrade pip
python -m pip install --no-cache-dir ".[dev]"
python -m pip install --no-cache-dir ".[dev, codecarbon]"
python -m spacy download en_core_web_sm
if: steps.restore-cache.outputs.cache-hit != 'true'

Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,22 @@ Types of changes
* "Security" in case of vulnerabilities.
-->

## [Unreleased]

### Added

- Added `SpanMarkerModel.generate_model_card()` method to get a model card string.
- Added `SpanMarkerModelCardData` that should be passed to `SpanMarkerModel.from_pretrained` with additional information like
- `language`, `license`, `model_name`, `model_id`, `encoder_name`, `encoder_id`, `dataset_name`, `dataset_id`, `dataset_revision`.

### Changed

- Heavily improved automatic model card generated.
- Evaluating outside of training now returns per-label outputs instead of only "overall" F1, precision and recall.
- Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.
- If so, then inference of that model will require the punctuation to be split from the words.
- Improve label normalization speed.

## [1.3.0]

### Added
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include span_marker/model_card_template.md
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,25 +46,36 @@ Please have a look at our [Getting Started](notebooks/getting_started.ipynb) not
```python
from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer
from span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData


def main() -> None:
# Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
dataset = load_dataset("DFKI-SLT/few-nerd", "supervised")
dataset_id = "DFKI-SLT/few-nerd"
dataset_name = "FewNERD"
dataset = load_dataset(dataset_id, "supervised")
dataset = dataset.remove_columns("ner_tags")
dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
labels = dataset["train"].features["ner_tags"].feature.names

# Initialize a SpanMarker model using a pretrained BERT-style encoder
model_name = "bert-base-cased"
encoder_id = "bert-base-cased"
model = SpanMarkerModel.from_pretrained(
model_name,
encoder_id,
labels=labels,
# SpanMarker hyperparameters:
model_max_length=256,
marker_max_length=128,
entity_max_length=8,
# Model card arguments
model_card_data=SpanMarkerModelCardData(
model_id="tomaarsen/span-marker-bert-base-fewnerd-fine-super",
encoder_id=encoder_id,
dataset_name=dataset_name,
dataset_id=dataset_id,
license="cc-by-sa-4.0",
language="en",
),
)

# Prepare the 🤗 transformers training arguments
Expand Down Expand Up @@ -121,8 +132,6 @@ entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B
{'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]
```

<!-- Because this work is based on [PL-Marker](https://arxiv.org/pdf/2109.06067v5.pdf), you may expect similar results to its [Papers with Code Leaderboard](https://paperswithcode.com/paper/pack-together-entity-and-relation-extraction) results. -->

## Pretrained Models

All models in this list contain `train.py` files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the [training_scripts](training_scripts) directory.
Expand Down
74 changes: 37 additions & 37 deletions notebooks/getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -76,20 +76,20 @@
"name": "stdout",
"output_type": "stream",
"text": [
"DatasetDict({\n",
" train: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 131767\n",
" })\n",
" validation: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 18824\n",
" })\n",
" test: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 37648\n",
" })\n",
"})"
"DatasetDict({\n",
" train: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 131767\n",
" })\n",
" validation: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 18824\n",
" })\n",
" test: Dataset({\n",
" features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],\n",
" num_rows: 37648\n",
" })\n",
"})"
]
}
],
Expand Down Expand Up @@ -317,9 +317,9 @@
"- 2 missed entities with 15 words (0.009828%)\n",
"- 1 missed entities with 17 words (0.004914%)\n",
"- 1 missed entities with 19 words (0.004914%)\n",
"Tracking run with wandb version 0.14.0\n",
"Run data is saved locally in ...\n",
"Syncing run colorful-leaf-761 to Weights & Biases\n"
"Tracking run with wandb version 0.14.0\n",
"Run data is saved locally in ...\n",
"Syncing run colorful-leaf-761 to Weights & Biases\n"
]
},
{
Expand Down Expand Up @@ -462,7 +462,7 @@
"text": [
"{'eval_loss': 0.019159900024533272, 'eval_overall_precision': 0.7773279352226721, 'eval_overall_recall': 0.7774778249132279, 'eval_overall_f1': 0.7774028728429576, 'eval_overall_accuracy': 0.9399702095533473, 'eval_runtime': 28.0225, 'eval_samples_per_second': 87.394, 'eval_steps_per_second': 21.875, 'epoch': 0.98}\n",
"{'train_runtime': 453.1296, 'train_samples_per_second': 21.667, 'train_steps_per_second': 2.708, 'train_loss': 0.06319850289734186, 'epoch': 1.0}\n",
"TrainOutput(global_step=1227, training_loss=0.06319850289734186, metrics={'train_runtime': 453.1296, 'train_samples_per_second': 21.667, 'train_steps_per_second': 2.708, 'train_loss': 0.06319850289734186, 'epoch': 1.0})"
"TrainOutput(global_step=1227, training_loss=0.06319850289734186, metrics={'train_runtime': 453.1296, 'train_samples_per_second': 21.667, 'train_steps_per_second': 2.708, 'train_loss': 0.06319850289734186, 'epoch': 1.0})"
]
}
],
Expand All @@ -489,15 +489,15 @@
"text": [
"Loading cached processed dataset at ...\n",
"Loading cached processed dataset at ...\n",
"{'eval_loss': 0.019206691533327103,\n",
" 'eval_overall_precision': 0.7758985200845666,\n",
" 'eval_overall_recall': 0.7784419591207096,\n",
" 'eval_overall_f1': 0.7771681586293194,\n",
" 'eval_overall_accuracy': 0.9398477830602543,\n",
" 'eval_runtime': 28.0849,\n",
" 'eval_samples_per_second': 87.2,\n",
" 'eval_steps_per_second': 21.827,\n",
" 'epoch': 1.0}"
"{'eval_loss': 0.019206691533327103,\n",
" 'eval_overall_precision': 0.7758985200845666,\n",
" 'eval_overall_recall': 0.7784419591207096,\n",
" 'eval_overall_f1': 0.7771681586293194,\n",
" 'eval_overall_accuracy': 0.9398477830602543,\n",
" 'eval_runtime': 28.0849,\n",
" 'eval_samples_per_second': 87.2,\n",
" 'eval_steps_per_second': 21.827,\n",
" 'epoch': 1.0}"
]
}
],
Expand Down Expand Up @@ -533,15 +533,15 @@
"- 1 missed entities with 17 words (0.019040%)\n",
"- 1 missed entities with 19 words (0.019040%)\n",
"- 1 missed entities with 40 words (0.019040%)\n",
"{'test_loss': 0.019189156591892242,\n",
" 'test_overall_precision': 0.769879287219774,\n",
" 'test_overall_recall': 0.7679663608562691,\n",
" 'test_overall_f1': 0.7689216342933691,\n",
" 'test_overall_accuracy': 0.938544749464231,\n",
" 'test_runtime': 28.0932,\n",
" 'test_samples_per_second': 86.854,\n",
" 'test_steps_per_second': 21.713,\n",
" 'epoch': 1.0}"
"{'test_loss': 0.019189156591892242,\n",
" 'test_overall_precision': 0.769879287219774,\n",
" 'test_overall_recall': 0.7679663608562691,\n",
" 'test_overall_f1': 0.7689216342933691,\n",
" 'test_overall_accuracy': 0.938544749464231,\n",
" 'test_runtime': 28.0932,\n",
" 'test_samples_per_second': 86.854,\n",
" 'test_steps_per_second': 21.713,\n",
" 'epoch': 1.0}"
]
}
],
Expand Down Expand Up @@ -660,7 +660,7 @@
"metadata": {},
"outputs": [],
"source": [
"# trainer.push_to_hub()"
"trainer.push_to_hub()"
]
},
{
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ dependencies = [
"torch",
"accelerate",
"transformers>=4.19.0", # required for EvalPrediction.inputs
"datasets>=2.0.0",
"datasets>=2.14.0", # required for sorting with multiple columns
"packaging>=20.0",
"evaluate",
"seqeval",
Expand Down Expand Up @@ -59,6 +59,9 @@ docs = [
wandb = [
"wandb"
]
codecarbon = [
"codecarbon"
]

[project.urls]
Documentation = "https://tomaarsen.github.io/SpanMarkerNER"
Expand Down
1 change: 1 addition & 0 deletions span_marker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from transformers import AutoConfig, AutoModel, TrainingArguments

from span_marker.configuration import SpanMarkerConfig
from span_marker.model_card import SpanMarkerModelCardData
from span_marker.modeling import SpanMarkerModel
from span_marker.trainer import Trainer

Expand Down
12 changes: 7 additions & 5 deletions span_marker/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
from span_marker.tokenizer import SpanMarkerTokenizer


def compute_f1_via_seqeval(tokenizer: SpanMarkerTokenizer, eval_prediction: EvalPrediction) -> Dict[str, float]:
def compute_f1_via_seqeval(
tokenizer: SpanMarkerTokenizer, eval_prediction: EvalPrediction, is_in_train: bool
) -> Dict[str, float]:
"""Compute micro-F1, recall, precision and accuracy scores using ``seqeval`` for the evaluation predictions.

Note:
Expand Down Expand Up @@ -98,7 +100,7 @@ def compute_f1_via_seqeval(tokenizer: SpanMarkerTokenizer, eval_prediction: Eval
with warnings.catch_warnings():
warnings.simplefilter("ignore", UndefinedMetricWarning)
results = seqeval.compute()
# `results` also contains e.g. "person-athlete": {'precision': 0.5982658959537572, 'recall': 0.9, 'f1': 0.71875, 'number': 230}
# logging this all is overkill. Tensorboard doesn't even support it, WandB does, but it's not very useful generally.
# I'd like to revisit this to expose this information somehow still
return {key: value for key, value in results.items() if isinstance(value, float)}

if is_in_train:
return {key: value for key, value in results.items() if isinstance(value, float)}
return results
17 changes: 10 additions & 7 deletions span_marker/label_normalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,17 @@ def __init__(self, config: SpanMarkerConfig) -> None:
self.config = config

@abstractmethod
def ner_tags_to_entities(self, ner_tags: List[int]) -> Iterator[Entity]:
pass

def __call__(self, tokens: List[str], ner_tags: List[int]) -> Dict[str, List[Any]]:
raise NotImplementedError
output = {"ner_tags": [], "entity_count": [], "word_count": []}
for tokens, ner_tags in zip(tokens, ner_tags):
ner_tags = list(self.ner_tags_to_entities(ner_tags))
output["ner_tags"].append(ner_tags)
output["entity_count"].append(len(ner_tags))
output["word_count"].append(len(tokens))
return output


class LabelNormalizerScheme(LabelNormalizer):
Expand Down Expand Up @@ -57,9 +66,6 @@ def ner_tags_to_entities(self, ner_tags: List[int]) -> Iterator[Entity]:
if start_idx is not None:
yield (reduced_label_id, start_idx, idx + 1)

def __call__(self, tokens: List[str], ner_tags: List[int]) -> Dict[str, List[Any]]:
return {"tokens": tokens, "ner_tags": list(self.ner_tags_to_entities(ner_tags))}


class LabelNormalizerIOB(LabelNormalizerScheme):
def __init__(self, config: SpanMarkerConfig) -> None:
Expand Down Expand Up @@ -108,9 +114,6 @@ def ner_tags_to_entities(self, ner_tags: List[int]) -> Iterator[Entity]:
if start_idx is not None:
yield (entity_label_id, start_idx, idx + 1)

def __call__(self, tokens: List[str], ner_tags: List[int]) -> Dict[str, List[Any]]:
return {"tokens": tokens, "ner_tags": list(self.ner_tags_to_entities(ner_tags))}


class AutoLabelNormalizer:
"""Factory class to return the correct LabelNormalizer subclass."""
Expand Down
Loading
Loading