Use ModelOutput instead of tuples #283

adrianeboyd · 2021-07-22T13:10:52Z

Save model output as ModelOutput instead of a list of tensors in TransformerData.model_output and FullTransformerBatch.model_output.
- For backwards compatibility with transformers v3 set return_dict = True in the transformer config.
TransformerData.tensors and FullTransformerBatch.tensors return ModelOutput.to_tuple().
Store any additional model output as ModelOutput in TransformerData.model_output.
- Save all torch.Tensor and tuple(torch.Tensor) values in TransformerData.model_output for cases where tensor.shape[0] is the batch size so that it's possible to slice the output for individual docs.
  - Includes: pooler_output, hidden_states, attentions, and cross_attentions
Re-enable tests for gpt2 and xlnet in the CI.
Following Fixing Transformer IO (attempt 2) #285, include some minor modifications and bug fixes for HFShim and HFObjects
- Rename the temporary init-only configs in HFObjects and don't serialize them in HFShim once the model is initialized

* Use `BaseModelOutput` instead of a list of tensors for the transformer model output and in `FullTransformerBatch.tensors`. * For backwards compatibility with transformers v3 set `return_dict = True` in the transforme config. * Rename `TransformerData.attention` to `TransformerData.attentions` for consistency with `BaseModelOutput`.

For compatibility with transformers v3.x

* Use `albert-base-v2` to reduce model downloads in tests * Test tagger training with `output_attentions` enabled * Check for `pooling_output` (unnamed) in `trf_data`

* Only store `last_hidden_state` in `TransformerData.tensors` * Move all `torch.Tensor` and `tuple(torch.Tensor)` values into `TransformerData.model_output` for cases where `tensor.shape[-2]` is the sequence length so that it's possible to slice the output for individual docs * Includes: `pooler_output`, `hidden_states`, `attentions`, and `cross_attentions`

adrianeboyd · 2021-07-26T17:46:40Z

@KennethEnevoldsen If you have time, could I ask you to take a look at this PR? I start to go cross-eyed after a bit with all the model formats/outputs and the doc splitting, and since the test suite is a bit sparse, it's possible I've missed something that would cause problems for workflows like yours.

The main API changes (also described above) are that FullTransformerBatch.tensors is ModelOutput instead of a tuple and that TransformerData.model_output contains any additional model output (as much as can be split by batch size) with TransformerData.tensors only containing last_hidden_state, so there's just model_output and no special cases in the code for attentions specifically. It's possible that it might be better or more generic to use the extra annotation setter to set these values, but I think it would require duplicating a lot of the doc splitting code.

FullTransformerBatch.tensors could possibly be aliased or renamed and maybe TransformerData.tensors, too, maybe also with last_hidden_state stored within model_output rather than separately. I'm also not sure whether it's confusing that ModelOutput is used in TransformerData to store non-torch tensors?

KennethEnevoldsen · 2021-07-27T11:18:15Z

I will take a look at it. It will probably be the start of next week though - hope that is fine

spacy_transformers/data_classes.py

KennethEnevoldsen

I think most of this will work just fine. Maybe with the exception of the zero tensor in an empty FullTransformerBatch.

Moving attention to the modelOutput seems reasonable and should also be easier to maintain. It might be slightly too nested. Here I think the extra annotation setters you mentioned is the better solution in terms of usability.

As you mentioned it might cause confusion that FullTransformerBatch.tensors and TransformerData.tensors aren't the same data type (and not a tensor in one case). I think the ideal situation here would be to use last_hidden_state, and model_output (maybe even other features of interest e.g. attention) and then use tensors for backward compatibility?

adrianeboyd · 2021-08-02T07:16:45Z

Thanks for the feedback! Thinking about it again, maybe both data classes should use ModelOutput internally as .model_output and .tensors becomes a property that's .model_output.to_tuple(). It could cause problems for people who are trying to write to tensors (which is rare, hopefully?) and we'd have to be sure that you can still extract last_hidden_state consistently, but that's something we should be able to manage internally.

KennethEnevoldsen · 2021-08-02T07:49:46Z

I think that is a reasonable approach, I think the assignment to .tensors is rare. Alternatively, you could set it once and allow it to be overwritten though I suspect this may lead to different tensors for modelOutput and .tensors.

In `TransformerData` and `FullTransformerBatch` use `ModelOutput` rather than a list of tensors as the primary model output data storage. In both classes, `.tensors` is converted to a property that returns `model_output.to_tuple()`.

adrianeboyd · 2021-08-02T12:07:51Z

Yeah, I think it would be really bad if they diverge, so let's try the tuple property for .tensors.

As in the documented API, return `List` for `TransformerData.tensors` and `FullTransformerBatch.tensors`.

adrianeboyd · 2021-08-06T11:38:13Z

This would be v1.1 anyway, so maybe it would better to just go ahead and have .tensors be Tuple instead of List to make it clear that it shouldn't be edited and closer to transformers behavior.

…output

honnibal · 2021-08-23T03:13:24Z

I think this looks good. The only thing I'd want to make sure we think about is that the full ModelOutput might include a lot of data, and if the data is on GPU the collection of documents can easily end up consuming all the memory.

The transformers library should let people set what they want stored into the ModelOutput, right? Do we expose that setting conveniently, including exposing it when people have loaded a trained pipeline?

adrianeboyd · 2021-08-23T07:26:43Z

None of the additional model output is enabled by default in transformers as far as I've seen, although I suppose that could change and it's kind of out of our control.

As it is now, the settings are fixed once you've loaded the model. I think this is mostly okay, and I'm not sure I'd want to go to a lot of trouble to let you change it for each batch.

I thought you would be able to do this but you can't override a value that's not there in the config by default, so hmm:

nlp = spacy.load("model", config={"components": {"transformer": {"model": {"transformer_config": {"output_attentions": True}}}}})

Existing config section:

[components.transformer.model.transformer_config]

The permitted values obviously vary by model, but maybe we need to figure out how to add the permitted ones as false in the saved config, although I would hope there might be an easier solution I'm not immediately seeing?

…output

adrianeboyd · 2021-09-06T19:41:05Z

With thinc v8.0.9, the option above with modifying transformer_config doesn't crash, but it doesn't work due to the modified serialization because it no longer calls huggingface_from_pretrained with the config settings when it gets loaded.

You can change the settings on-the-fly like this:

nlp = spacy.blank("en")
nlp.add_pipe("transformer", config={"model": {"name": "distilbert-base-uncased", "transformer_config": {"output_attentions": True}}})
nlp.initialize()
assert "attentions" in nlp("test")._.trf_data.model_output

nlp.get_pipe("transformer").model.transformer.config.output_attentions = False
assert "attentions" not in nlp("test")._.trf_data.model_output

At this point, parts of the transformer config are in three unsynced places:

# does not include the `return_dict` addition, is only used on initialization
nlp.config["components"]["transformer"]["model"]["transformer_config"]

# includes all the settings, editing on-the-fly works and these are the settings that matter for serialization
nlp.get_pipe("transformer").model.transformer.config

# the internal saved version of the original config.cfg config + `return_dict`,
# but it's not synced (like all model/cfg settings) and also not used on 
# deserialization (unlike other model/cfg settings)
nlp.get_pipe("transformer").model.transformer_config

This is at least one too many, but I have to look at how to avoid saving the third one. Possibly the first one should be moved into [initialize], although that would break what a lot of people are used to. You can also easily override the second one with the third one on deserialization with AutoConfig, which is similar to what happens on initialization with huggingface_from_pretrained, but this has the effect that your config edits as shown above aren't restored on deserialization. (This is actually kind of what actual transformers does, at least for the tokenizer. Settings can be modified on-the-fly but they're not saved because the saved config is stored somewhere else in the object. I don't like this as a design choice, though.)

adrianeboyd · 2021-09-06T19:46:32Z

Ah, in terms of f2bd2b6 , the equivalent edit would be this on deserialization rather than overriding forward:

config = AutoConfig.from_pretrained(config_file, **trf_config)

I need to look into the details a bit more...

And as a note, I suspect it will be necessary to re-add return_dict on deserialization for v3, since it doesn't seem to be included in the saved config, but I haven't tested it so I'm not entirely sure yet.

Only use `HFObjects.tokenizer/transformer_config` on init and use the internal configs from `HFObjects.tokenizer/transformer` during serialization. To make this clearer, rename the init-only config dict fields in `HFObjects` to `init_tokenizer_config` and `init_transformer_config`.

This is necessary for transformers==3.4.0, at least.

adrianeboyd · 2021-09-07T13:00:51Z

I tried to split the HFShim edits into a separate PR, but the merges were complicated enough due to the output_attentions changes that I decided to just go ahead and keep it here.

(I swear this will work someday soon.)

adrianeboyd · 2021-09-07T13:06:58Z

So I think I would theoretically prefer to move the transformer model settings into [initialize] but I think this is going to be too difficult of a change to make at this point. Now it's not that different than other components that serialize some internal settings as model/cfg in an unsynced way.

To make it clearer that this is entirely internals, prefix init configs with underscore: `HFObjects._init_tokenizer/transformer_config`

adrianeboyd · 2021-09-08T09:52:38Z

I think it'd be nicer to move the init configs out of HFObjects entirely, but given the number of layers involved here, it doesn't seem worth the effort at this point if everything works and it's clear (enough) to the user where the settings can be modified.

To handle the case where the pipeline is serialized before the transformer component is initialized, save the init configs if the model is not initialized and restore the uninitialized `HFObjects` status.

adrianeboyd · 2021-09-16T18:18:01Z

As a note: this wasn't actually working correctly when merged and the culprit looks like it's 4ee6e5a. More to come...

adrianeboyd added 5 commits July 22, 2021 15:05

Add tests for output_attentions

Unverified

No user is associated with the committer email.

Learn about vigilant mode

df938e2

Switch to module scope for large models in tests

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired
Learn about vigilant mode

a7ab21c

Provide last_hidden_state in empty BaseModelOutput

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired
Learn about vigilant mode

2cb59dd

For compatibility with transformers v3.x

Switch linux 3.8 to linux 3.9

89c985e

adrianeboyd marked this pull request as draft July 22, 2021 20:03

adrianeboyd added 6 commits July 23, 2021 15:50

Use ModelOutput consistently

3ad0615

Update tests for pooling output and attentions

65cb8ac

* Use `albert-base-v2` to reduce model downloads in tests * Test tagger training with `output_attentions` enabled * Check for `pooling_output` (unnamed) in `trf_data`

Store attentions as part of ModelOutput in FullTransformerBatch

c1be060

Update docstrings

115d9fd

Serialize TransformerData.model_output

35f4fa6

adrianeboyd marked this pull request as ready for review July 26, 2021 14:04

Use batch_size rather than sequence_length

faee520

KennethEnevoldsen reviewed Aug 1, 2021

View reviewed changes

spacy_transformers/data_classes.py Outdated Show resolved Hide resolved

KennethEnevoldsen reviewed Aug 1, 2021

View reviewed changes

adrianeboyd added 3 commits August 2, 2021 12:01

Use empty ModelOutput in empty FullTransformerBatch

d8447d0

Make ModelOutput primary in data classes

17adbea

In `TransformerData` and `FullTransformerBatch` use `ModelOutput` rather than a list of tensors as the primary model output data storage. In both classes, `.tensors` is converted to a property that returns `model_output.to_tuple()`.

Convert tensors tuple to list in accumulate_gradient

728ac3b

adrianeboyd added 2 commits August 2, 2021 17:34

Also check for attentions in test

4928bb3

Use List for .tensors

3b6be3d

As in the documented API, return `List` for `TransformerData.tensors` and `FullTransformerBatch.tensors`.

Switch the tensor properties back to tuples

bd60e8b

adrianeboyd requested a review from honnibal August 10, 2021 11:20

Merge branch 'master' into bugfix/use-model-output

e230c59

adrianeboyd force-pushed the bugfix/use-model-output branch from b7dd2c1 to e230c59 Compare August 11, 2021 12:05

svlandeg added the enhancement label Aug 13, 2021

adrianeboyd mentioned this pull request Aug 19, 2021

Cannot load xlnet-base-cased transformer explosion/spaCy#8996

Closed

adrianeboyd added 3 commits August 21, 2021 10:35

Use transformer_config in AutoConfig rather than forward

4ee6e5a

Switch test to spacy-transformers.TransformerModel.v2

5ad8a4c

Merge remote-tracking branch 'upstream/master' into bugfix/use-model-…

390c05d

…output

svlandeg added the v1.1 label Aug 25, 2021

adrianeboyd mentioned this pull request Aug 26, 2021

transformer_config passed to transformer.forward explosion/spaCy#8995

Closed

ines mentioned this pull request Aug 31, 2021

Allow config overrides to add new keys explosion/thinc#527

Merged

adrianeboyd added 2 commits September 6, 2021 20:12

Merge remote-tracking branch 'upstream/master' into bugfix/use-model-…

8c50018

…output

Also use config rather than overriding forward in HFShim

f2bd2b6

adrianeboyd added 3 commits September 7, 2021 14:17

Also convert path to string for serialization in HFShim

c70c16f

This is necessary for transformers==3.4.0, at least.

Add test for serialization of transformer config

35ca983

adrianeboyd added 2 commits September 8, 2021 09:26

Reorder tokenizer/transformer in HFObjects

3a77fa5

Rename internal init-only configs in HFOjects

29d1dc8

To make it clearer that this is entirely internals, prefix init configs with underscore: `HFObjects._init_tokenizer/transformer_config`

adrianeboyd added 2 commits September 8, 2021 12:45

Add tests for modified tokenizer.model_max_length

428dfe7

Save init tok/trf configs if not initialized

f47a8be

To handle the case where the pipeline is serialized before the transformer component is initialized, save the init configs if the model is not initialized and restore the uninitialized `HFObjects` status.

adrianeboyd merged commit 9020c79 into explosion:master Sep 9, 2021

svlandeg mentioned this pull request Sep 15, 2021

Fixing ModelOutput training #299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ModelOutput instead of tuples #283

Use ModelOutput instead of tuples #283

adrianeboyd commented Jul 22, 2021 •

edited

Loading

adrianeboyd commented Jul 26, 2021

KennethEnevoldsen commented Jul 27, 2021

KennethEnevoldsen left a comment

adrianeboyd commented Aug 2, 2021

KennethEnevoldsen commented Aug 2, 2021

adrianeboyd commented Aug 2, 2021

adrianeboyd commented Aug 6, 2021

honnibal commented Aug 23, 2021

adrianeboyd commented Aug 23, 2021

adrianeboyd commented Sep 6, 2021 •

edited

Loading

adrianeboyd commented Sep 6, 2021 •

edited

Loading

adrianeboyd commented Sep 7, 2021

adrianeboyd commented Sep 7, 2021

adrianeboyd commented Sep 8, 2021

adrianeboyd commented Sep 16, 2021

Use ModelOutput instead of tuples #283

Use ModelOutput instead of tuples #283

Conversation

adrianeboyd commented Jul 22, 2021 • edited Loading

adrianeboyd commented Jul 26, 2021

KennethEnevoldsen commented Jul 27, 2021

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

adrianeboyd commented Aug 2, 2021

KennethEnevoldsen commented Aug 2, 2021

adrianeboyd commented Aug 2, 2021

adrianeboyd commented Aug 6, 2021

honnibal commented Aug 23, 2021

adrianeboyd commented Aug 23, 2021

adrianeboyd commented Sep 6, 2021 • edited Loading

adrianeboyd commented Sep 6, 2021 • edited Loading

adrianeboyd commented Sep 7, 2021

adrianeboyd commented Sep 7, 2021

adrianeboyd commented Sep 8, 2021

adrianeboyd commented Sep 16, 2021

adrianeboyd commented Jul 22, 2021 •

edited

Loading

adrianeboyd commented Sep 6, 2021 •

edited

Loading

adrianeboyd commented Sep 6, 2021 •

edited

Loading