-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ModelOutput instead of tuples #283
Use ModelOutput instead of tuples #283
Conversation
* Use `BaseModelOutput` instead of a list of tensors for the transformer model output and in `FullTransformerBatch.tensors`. * For backwards compatibility with transformers v3 set `return_dict = True` in the transforme config. * Rename `TransformerData.attention` to `TransformerData.attentions` for consistency with `BaseModelOutput`.
For compatibility with transformers v3.x
* Use `albert-base-v2` to reduce model downloads in tests * Test tagger training with `output_attentions` enabled * Check for `pooling_output` (unnamed) in `trf_data`
* Only store `last_hidden_state` in `TransformerData.tensors` * Move all `torch.Tensor` and `tuple(torch.Tensor)` values into `TransformerData.model_output` for cases where `tensor.shape[-2]` is the sequence length so that it's possible to slice the output for individual docs * Includes: `pooler_output`, `hidden_states`, `attentions`, and `cross_attentions`
@KennethEnevoldsen If you have time, could I ask you to take a look at this PR? I start to go cross-eyed after a bit with all the model formats/outputs and the doc splitting, and since the test suite is a bit sparse, it's possible I've missed something that would cause problems for workflows like yours. The main API changes (also described above) are that
|
I will take a look at it. It will probably be the start of next week though - hope that is fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of this will work just fine. Maybe with the exception of the zero tensor in an empty FullTransformerBatch.
Moving attention to the modelOutput seems reasonable and should also be easier to maintain. It might be slightly too nested. Here I think the extra annotation setters you mentioned is the better solution in terms of usability.
As you mentioned it might cause confusion that FullTransformerBatch.tensors and TransformerData.tensors aren't the same data type (and not a tensor in one case). I think the ideal situation here would be to use last_hidden_state
, and model_output
(maybe even other features of interest e.g. attention) and then use tensors for backward compatibility?
Thanks for the feedback! Thinking about it again, maybe both data classes should use |
I think that is a reasonable approach, I think the assignment to |
In `TransformerData` and `FullTransformerBatch` use `ModelOutput` rather than a list of tensors as the primary model output data storage. In both classes, `.tensors` is converted to a property that returns `model_output.to_tuple()`.
Yeah, I think it would be really bad if they diverge, so let's try the tuple property for |
As in the documented API, return `List` for `TransformerData.tensors` and `FullTransformerBatch.tensors`.
This would be v1.1 anyway, so maybe it would better to just go ahead and have |
b7dd2c1
to
e230c59
Compare
I think this looks good. The only thing I'd want to make sure we think about is that the full The |
None of the additional model output is enabled by default in As it is now, the settings are fixed once you've loaded the model. I think this is mostly okay, and I'm not sure I'd want to go to a lot of trouble to let you change it for each batch. I thought you would be able to do this but you can't override a value that's not there in the config by default, so hmm: nlp = spacy.load("model", config={"components": {"transformer": {"model": {"transformer_config": {"output_attentions": True}}}}}) Existing config section: [components.transformer.model.transformer_config] The permitted values obviously vary by model, but maybe we need to figure out how to add the permitted ones as false in the saved config, although I would hope there might be an easier solution I'm not immediately seeing? |
With thinc v8.0.9, the option above with modifying You can change the settings on-the-fly like this: nlp = spacy.blank("en")
nlp.add_pipe("transformer", config={"model": {"name": "distilbert-base-uncased", "transformer_config": {"output_attentions": True}}})
nlp.initialize()
assert "attentions" in nlp("test")._.trf_data.model_output
nlp.get_pipe("transformer").model.transformer.config.output_attentions = False
assert "attentions" not in nlp("test")._.trf_data.model_output At this point, parts of the transformer config are in three unsynced places: # does not include the `return_dict` addition, is only used on initialization
nlp.config["components"]["transformer"]["model"]["transformer_config"]
# includes all the settings, editing on-the-fly works and these are the settings that matter for serialization
nlp.get_pipe("transformer").model.transformer.config
# the internal saved version of the original config.cfg config + `return_dict`,
# but it's not synced (like all model/cfg settings) and also not used on
# deserialization (unlike other model/cfg settings)
nlp.get_pipe("transformer").model.transformer_config This is at least one too many, but I have to look at how to avoid saving the third one. Possibly the first one should be moved into |
Ah, in terms of f2bd2b6 , the equivalent edit would be this on deserialization rather than overriding config = AutoConfig.from_pretrained(config_file, **trf_config) I need to look into the details a bit more... And as a note, I suspect it will be necessary to re-add |
Only use `HFObjects.tokenizer/transformer_config` on init and use the internal configs from `HFObjects.tokenizer/transformer` during serialization. To make this clearer, rename the init-only config dict fields in `HFObjects` to `init_tokenizer_config` and `init_transformer_config`.
This is necessary for transformers==3.4.0, at least.
I tried to split the (I swear this will work someday soon.) |
So I think I would theoretically prefer to move the transformer model settings into |
To make it clearer that this is entirely internals, prefix init configs with underscore: `HFObjects._init_tokenizer/transformer_config`
I think it'd be nicer to move the init configs out of |
To handle the case where the pipeline is serialized before the transformer component is initialized, save the init configs if the model is not initialized and restore the uninitialized `HFObjects` status.
As a note: this wasn't actually working correctly when merged and the culprit looks like it's 4ee6e5a. More to come... |
Save model output as
ModelOutput
instead of a list of tensors inTransformerData.model_output
andFullTransformerBatch.model_output
.return_dict = True
in the transformer config.TransformerData.tensors
andFullTransformerBatch.tensors
returnModelOutput.to_tuple()
.Store any additional model output as
ModelOutput
inTransformerData.model_output
.torch.Tensor
andtuple(torch.Tensor)
values inTransformerData.model_output
for cases wheretensor.shape[0]
is the batch size so that it's possible to slice the output for individual docs.pooler_output
,hidden_states
,attentions
, andcross_attentions
Re-enable tests for
gpt2
andxlnet
in the CI.Following Fixing Transformer IO (attempt 2) #285, include some minor modifications and bug fixes for
HFShim
andHFObjects
HFObjects
and don't serialize them inHFShim
once the model is initialized